TL;DR
- 78% of marketplace messages are answerable by AI with grounded retrieval (order data + product specs + policy).
- Don't let LLMs answer: warranty disputes, A-to-z claims, custom-product questions, or anything implying a refund commitment.
- First response under 60 minutes is the seller-score threshold to optimize for, not "instant".
- Tone matters as much as accuracy. Warm boilerplate beats cold precision in marketplace messaging.
What works
1. Grounded order-status questions
"Where's my package?" is 31% of all buyer messages. With order tracking and carrier API access, an LLM gives a better answer than a human in less time.
2. Spec clarifications from product data
"Does this fit a 2017 Civic?" "Is this dishwasher safe?" Pull from your product attributes, answer with citations.
3. Policy questions
Returns, shipping, payment options. Build a policy doc, retrieve from it, never invent.
4. Multilingual triage
An LLM translating Polish or German to English with context preservation beats Google Translate by a wide margin. Even if a human handles the response, AI-translated triage shaves 40% off response time.
Where it backfires
1. Warranty and quality complaints
Anything implying defect needs a human. LLMs over-apologize, over-promise refunds, and create commitments your policy can't honor.
2. A-to-z and dispute messages
Marketplace dispute messages are legal/financial communications. Always human, always documented.
3. Custom-product or out-of-catalog questions
If the answer isn't in the product database, an LLM will hallucinate plausibly. Route these to humans.
4. Anything emotional
"My grandmother received this and..." — even if the request is operationally simple, the emotional context needs human empathy and judgment. Sentiment classifiers catch most of these.
Operator's tip: Build the routing layer before the answering layer. The hardest engineering problem in AI support isn't generation — it's deciding what not to answer. We use a 3-class classifier: AI-handle, AI-draft-human-send, human-only.
Metrics that matter
- Resolution rate without escalation: 78% is a healthy baseline. Above 88% means you're letting AI take risks it shouldn't.
- First-response time: Aim for under 60 min, not under 60 sec. Buyers don't trust instant responses for sensitive issues.
- Human override rate: What percentage of AI drafts the human edits before sending? Below 15% means your policy doc is good.
- CSAT post-resolution: Track the gap between AI-resolved and human-resolved. If AI is more than 5 points below, simplify your bot.
"We turned the bot off for warranty claims after one week. CSAT on those went from 3.8 to 4.6 with a human in the loop. We left the bot on for 'where's my order?' and CSAT there is 4.7 — higher than humans, because it's faster." — Marek S., support lead at an electronics seller on Amazon DE
Want AI support that knows when to defer to humans? FeedPilot's CX engine handles all four categories above with the routing logic baked in. See how it works.
24/7 customer support in 18 languages
AI handles the easy stuff. Humans handle the rest. Buyers can't tell the difference.
Try for free →