Skip to content
Chatbotscape
Verified
Intent Recognition· NLU task
Intent recognition (also called intent classification or intent detection) is the NLU task of mapping a user's message to a predefined category of what they want. For example, a message like: I need to change my address — maps to the intent update_address. Intent recognition lets chatbots route messages to the appropriate flow or response, and is a core building block of NLU-driven conversational AI.
By Chatbotscape Editorial· Methodology· Published 26 May 2026· Updated 26 May 2026

Intent Recognition — Definition, How It Works, and Chatbot Examples (2026)

Quick answer~1 min
Intent recognition is figuring out what a user wants in a chatbot. Their message gets classified into a predefined intent like "book_appointment" or "cancel_order", and the chatbot uses that classification to pick the right reply.

What it is

Intent recognition is the most fundamental NLU task in conversational AI. The bot designer defines a set of intents (typically 5-50 for a production chatbot — more is hard to manage), provides training examples per intent, and the NLU engine learns to map user messages to the closest match.

Example intent definitions:

  • book_appointment — training: "I want to schedule a meeting", "book me for tomorrow", "I'd like an appointment", "can we set up a time?"
  • cancel_subscription — training: "I want to cancel", "please end my subscription", "remove my account", "I'm done — stop billing."
  • check_order_status — training: "where's my order?", "order status update", "when will my package arrive?"

At runtime, a user message is matched to the closest intent (with a confidence score) and the bot triggers the appropriate flow.

How intent recognition works technically

Two main approaches:

1. Traditional ML classifier

Vectorize the message (TF-IDF, word embeddings) → train a classifier (logistic regression, SVM, or neural network) → output probabilities per intent. The platform (Dialogflow, Rasa, Botpress NLU) handles all this; the operator only provides training examples.

Strengths: cheap to run, predictable, easy to audit. Weaknesses: requires labeled training data, falls off cliff for utterances outside training distribution.

2. LLM-based zero-shot or few-shot classification

Prompt an LLM with the list of intents and descriptions, and ask "which intent matches this message?" No explicit training required.

Strengths: works with zero examples, handles phrasing variation gracefully, multilingual out-of-the-box. Weaknesses: more expensive per call (LLM tokens), less deterministic.

Modern production systems often combine both — cheap classifier for high-confidence matches, LLM fallback for ambiguous cases.

Intent recognition in chatbot platforms

  • Dialogflow — intent-first architecture. Operators define intents with training phrases, and Dialogflow trains a classifier.
  • Botpress — supports both NLU intent classification and LLM-driven intent reasoning.
  • Manychat — flow-driven with button taps bypass intent recognition; LLM-based "AI Replies" feature does implicit intent handling.
  • Rasa — open-source with dedicated intent classification component.

Best practices

  • Keep intent set small. 5-15 intents per domain. Too many causes overlap and misclassification.
  • Provide 10-30 training examples per intent. Cover phrasing variation (formal, casual, typo-laden, multilingual if applicable).
  • Include negative examples. Train an "out-of-scope" catchall intent for when nothing matches.
  • Test against real user messages. Synthetic training data doesn't reflect actual user phrasing. Mine real chat logs for new training examples weekly.
  • Set a confidence threshold. Below the threshold, fall back to a clarification question rather than guessing.

Worked example — rule-based vs LLM zero-shot

Consider a small e-commerce chatbot with three intents: check_order_status, request_refund, and talk_to_human. Here is how each approach handles representative test utterances:

User messageRule-based classifierLLM zero-shot
"where's my order?"check_order_status (0.94) ✅check_order_status
"wheres my stuff" (typo, casual)unmatched — fallback ⚠️check_order_status
"this product is broken, I need a refund"request_refund (0.91) ✅request_refund
"I want to cancel AND get my money back"request_refund (0.65) — misses cancel ⚠️both intents flagged ✅
"Quero saber sobre meu pedido" (PT-BR)unmatched ❌check_order_status
"Operating hours today?"unmatched — fallback ⚠️out-of-scope flagged ✅

The classifier wins on cost and predictability (≈1 ms per call, no LLM token cost) but fails on typos, multi-intent utterances, and non-English. The LLM handles all three but costs roughly 50-200× more per call. Production systems often combine: a fast classifier handles the high-confidence majority; low-confidence cases route to the LLM.

Common failure modes

  1. Intent overlap. Two intents share too much vocabulary — check_order_status and check_shipping_eta both trigger on "where's my order." Fix by merging or by adding entity extraction to disambiguate.
  2. Out-of-scope underdetection. Without an explicit out-of-scope intent, every message gets classified into the closest defined intent — even when none fits. Always train an "other" catchall.
  3. Training data drift. Real user phrasing evolves. A bot trained six months ago misses newer slang or product references. Refresh quarterly.
  4. Multilingual gaps. Training in one language doesn't transfer. Train per-language separately or use an LLM zero-shot fallback for unsupported locales.

FAQ

Is intent recognition the same as classification?

Yes — "intent classification", "intent detection", and "intent recognition" are used interchangeably. All describe mapping a user message to a predefined intent category.

How many intents should I have in a chatbot?

5-15 per domain works well. Above 30-50, intent overlap and misclassification grow rapidly. Larger systems typically decompose into multiple bots or use hierarchical intents.

Can LLMs replace intent classification entirely?

For open-ended deployments, often yes — the LLM understands what the user wants without an explicit intent label. For high-volume or cost-sensitive scenarios, explicit classification is still cheaper and more auditable.

How do I measure intent classification quality?

Run a held-out test set (200+ representative utterances if available) through the classifier and compute three metrics: intent accuracy (% mapped to correct intent), per-intent F1 (precision × recall, useful for spotting underperforming intents), and out-of-scope detection (% of off-topic messages flagged correctly). For multilingual deployments, run separate test sets per language — per-language F1 often varies more than overall accuracy suggests.

What's the threshold for retraining a classifier?

When per-intent F1 drops below 0.80 on a fresh test set, or when fallback rate exceeds 15% in production, retrain. Both signals indicate either training data drift (user phrasing evolved) or intent overlap (new intents added without rebalancing). Mature deployments retrain quarterly even without these triggers, just to catch slow drift.

Sources