Intent Recognition· NLU task
Intent Recognition — Definition, How It Works, and Chatbot Examples (2026)
Quick answer~1 min
What it is
Intent recognition is the most fundamental NLU task in conversational AI. The bot designer defines a set of intents (typically 5-50 for a production chatbot — more is hard to manage), provides training examples per intent, and the NLU engine learns to map user messages to the closest match.
Example intent definitions:
book_appointment— training: "I want to schedule a meeting", "book me for tomorrow", "I'd like an appointment", "can we set up a time?"cancel_subscription— training: "I want to cancel", "please end my subscription", "remove my account", "I'm done — stop billing."check_order_status— training: "where's my order?", "order status update", "when will my package arrive?"
At runtime, a user message is matched to the closest intent (with a confidence score) and the bot triggers the appropriate flow.
How intent recognition works technically
Two main approaches:
1. Traditional ML classifier
Vectorize the message (TF-IDF, word embeddings) → train a classifier (logistic regression, SVM, or neural network) → output probabilities per intent. The platform (Dialogflow, Rasa, Botpress NLU) handles all this; the operator only provides training examples.
Strengths: cheap to run, predictable, easy to audit. Weaknesses: requires labeled training data, falls off cliff for utterances outside training distribution.
2. LLM-based zero-shot or few-shot classification
Prompt an LLM with the list of intents and descriptions, and ask "which intent matches this message?" No explicit training required.
Strengths: works with zero examples, handles phrasing variation gracefully, multilingual out-of-the-box. Weaknesses: more expensive per call (LLM tokens), less deterministic.
Modern production systems often combine both — cheap classifier for high-confidence matches, LLM fallback for ambiguous cases.
Intent recognition in chatbot platforms
- Dialogflow — intent-first architecture. Operators define intents with training phrases, and Dialogflow trains a classifier.
- Botpress — supports both NLU intent classification and LLM-driven intent reasoning.
- Manychat — flow-driven with button taps bypass intent recognition; LLM-based "AI Replies" feature does implicit intent handling.
- Rasa — open-source with dedicated intent classification component.
Best practices
- Keep intent set small. 5-15 intents per domain. Too many causes overlap and misclassification.
- Provide 10-30 training examples per intent. Cover phrasing variation (formal, casual, typo-laden, multilingual if applicable).
- Include negative examples. Train an "out-of-scope" catchall intent for when nothing matches.
- Test against real user messages. Synthetic training data doesn't reflect actual user phrasing. Mine real chat logs for new training examples weekly.
- Set a confidence threshold. Below the threshold, fall back to a clarification question rather than guessing.
Worked example — rule-based vs LLM zero-shot
Consider a small e-commerce chatbot with three intents: check_order_status, request_refund, and talk_to_human. Here is how each approach handles representative test utterances:
| User message | Rule-based classifier | LLM zero-shot |
|---|---|---|
| "where's my order?" | check_order_status (0.94) ✅ | check_order_status ✅ |
| "wheres my stuff" (typo, casual) | unmatched — fallback ⚠️ | check_order_status ✅ |
| "this product is broken, I need a refund" | request_refund (0.91) ✅ | request_refund ✅ |
| "I want to cancel AND get my money back" | request_refund (0.65) — misses cancel ⚠️ | both intents flagged ✅ |
| "Quero saber sobre meu pedido" (PT-BR) | unmatched ❌ | check_order_status ✅ |
| "Operating hours today?" | unmatched — fallback ⚠️ | out-of-scope flagged ✅ |
The classifier wins on cost and predictability (≈1 ms per call, no LLM token cost) but fails on typos, multi-intent utterances, and non-English. The LLM handles all three but costs roughly 50-200× more per call. Production systems often combine: a fast classifier handles the high-confidence majority; low-confidence cases route to the LLM.
Common failure modes
- Intent overlap. Two intents share too much vocabulary —
check_order_statusandcheck_shipping_etaboth trigger on "where's my order." Fix by merging or by adding entity extraction to disambiguate. - Out-of-scope underdetection. Without an explicit out-of-scope intent, every message gets classified into the closest defined intent — even when none fits. Always train an "other" catchall.
- Training data drift. Real user phrasing evolves. A bot trained six months ago misses newer slang or product references. Refresh quarterly.
- Multilingual gaps. Training in one language doesn't transfer. Train per-language separately or use an LLM zero-shot fallback for unsupported locales.
Related terms
- Natural Language Understanding — the broader category containing intent recognition.
- Entity extraction — the complementary task of pulling structured data from utterances.
- Natural Language Processing — the parent field.
FAQ
Is intent recognition the same as classification?
Yes — "intent classification", "intent detection", and "intent recognition" are used interchangeably. All describe mapping a user message to a predefined intent category.
How many intents should I have in a chatbot?
5-15 per domain works well. Above 30-50, intent overlap and misclassification grow rapidly. Larger systems typically decompose into multiple bots or use hierarchical intents.
Can LLMs replace intent classification entirely?
For open-ended deployments, often yes — the LLM understands what the user wants without an explicit intent label. For high-volume or cost-sensitive scenarios, explicit classification is still cheaper and more auditable.
How do I measure intent classification quality?
Run a held-out test set (200+ representative utterances if available) through the classifier and compute three metrics: intent accuracy (% mapped to correct intent), per-intent F1 (precision × recall, useful for spotting underperforming intents), and out-of-scope detection (% of off-topic messages flagged correctly). For multilingual deployments, run separate test sets per language — per-language F1 often varies more than overall accuracy suggests.
What's the threshold for retraining a classifier?
When per-intent F1 drops below 0.80 on a fresh test set, or when fallback rate exceeds 15% in production, retrain. Both signals indicate either training data drift (user phrasing evolved) or intent overlap (new intents added without rebalancing). Mature deployments retrain quarterly even without these triggers, just to catch slow drift.
Sources
- Google Cloud. Dialogflow intents. cloud.google.com/dialogflow (verified 26 May 2026).
- Rasa documentation. NLU intent classification. rasa.com/docs/rasa/nlu-only (verified 26 May 2026).
- Industry practitioner blogs (Botpress, Microsoft Bot Framework).