Skip to content
Chatbotscape
Verified
Intent Confidence· NLU metric
Intent confidence is the score an NLU engine attaches to its best guess at what a user meant — a number, usually between 0 and 1, that says how sure the classifier is that a message belongs to a given intent. It is not the answer; it is the bot's certainty about the answer. That distinction is the whole point: a chatbot that ignores the confidence score and acts on every guess will confidently give wrong answers, and one that demands near-certainty before acting will fall back on perfectly reasonable questions. The score exists so the bot can decide whether to act, ask, or hand off.
By Chatbotscape Editorial· Methodology· Published 20 June 2026· Updated 20 June 2026

Intent Confidence — What the Score Means and How to Act on It (2026)

Quick answer: Intent confidence is how sure the bot is that it understood you, expressed as a number — typically 0 to 1, where 0.9 means "very sure" and 0.4 means "barely a guess." Every intent-recognition engine produces one alongside its match. What matters is not the score itself but what the bot does with it: act on a high score, ask a clarifying question on a middling one, and fall back or hand off on a low one. The single most common cause of a chatbot answering the wrong question is a bot that was built to act on its top guess regardless of how unsure it was.

What it is

When a user types "where's my stuff," an NLU classifier does not return a single clean answer. It returns a ranked list of candidate intents, each with a score: maybe check_order_status at 0.88, track_shipment at 0.71, and request_refund at 0.22. Intent confidence is that top number — the engine's own estimate of how well the message matches the intent it picked. A natural language understanding system uses it as a gate: above some line the bot trusts the match and runs the flow, below it the bot does something safer than guessing.

The reason the score exists at all is that misunderstanding is not free. A bot with no confidence gate treats its best guess as gospel, so a message it barely understood gets the same confident treatment as one it nailed. Confidence is the mechanism that lets a bot say, in effect, "I think you want order status, but I'm only 55% sure, so let me check before I send you down that path." That single behavior is the difference between a bot that feels careful and one that feels reckless.

The number is a relative signal, not a probability

The most important and least understood fact about intent confidence: the score is not a true probability. A confidence of 0.82 does not mean the classifier is right 82% of the time when it reports 0.82. It means this match scored higher than that one, on whatever internal scale the model uses. Treating the number as a calibrated probability is the classic mistake, and it leads operators to set thresholds by gut feeling and then wonder why the bot still misfires.

What the score reliably gives you is ordering — a 0.9 match is genuinely more trustworthy than a 0.5 match on the same bot. What it does not give you is a portable meaning: 0.7 on one platform's classifier is not the same trust level as 0.7 on another's, and the same engine's scores can shift after retraining. So the right way to read confidence is comparatively and empirically: pull real transcripts, look at which scores actually corresponded to correct matches on your bot, and set your bands from that evidence rather than from the number's face value.

The three-band decision

Confidence is only useful if it changes the bot's behavior, and the durable pattern is three bands rather than a single on/off threshold. The working figures below are editorial guides drawn from deployment practice, not a published benchmark — the bands matter more than the exact cutoffs, which you calibrate per bot.

Confidence bandWhat it meansWhat the bot should do
High (≈ 0.85+)Strong, unambiguous matchAct on the intent — run the flow, no friction
Medium (≈ 0.5-0.85)Plausible but not certainConfirm: "Did you mean order tracking?" with tappable options
Low (≈ below 0.5)Weak or no real matchFall back gracefully, then offer a human handoff

The medium band is where good bots are separated from bad ones. A two-band bot (act or fail) has no choice but to either guess on a 0.6 — and risk a wrong answer — or fall back on it and frustrate a user it half-understood. A three-band bot turns that 0.6 into a one-tap confirmation, which costs the user a second and saves the conversation. The design decision behind those bands is its own discipline, covered in the companion guide on designing a confidence policy.

The threshold trade-off

The line between "act" and "don't act" is a single knob with two failure modes pulling in opposite directions, and you cannot escape the trade — you can only choose where to sit on it.

Set the threshold too high and the bot demands near-certainty it rarely has, so reasonable messages score just under the line and trigger the fallback intent. The visible symptom is a climbing fallback rate: the bot says "sorry, I didn't understand" to people it understood perfectly well.

Set the threshold too low and the bot acts on weak guesses, so messages it barely matched get force-routed into the nearest intent. The symptom here is invisible in the fallback metric — the user gets a confident wrong answer, not an apology, and nothing logs a failure. This is the more dangerous setting precisely because it hides. The only way to catch it is to read transcripts for wrong-answer turns, which is why the reduce-fallback-rate playbook insists on pairing any threshold change with a wrong-answer check rather than watching fallback rate alone.

The honest framing: lowering the threshold trades visible failures for invisible ones. Most of the time visible-and-recoverable beats invisible-and-wrong, which is why a slightly cautious bot with a clean confirmation step usually outperforms an eager one that guesses.

Classifier confidence versus LLM "confidence"

Where confidence comes from matters, because the two main engine types produce it very differently.

A traditional ML classifier — the kind behind intent-first builders — emits a genuine score per intent as part of its output. You can read it, threshold it, and log it directly. It is imperfectly calibrated, as noted above, but it exists as a first-class number you control.

A large language model does not natively hand you a clean intent-confidence score, and worse, LLMs are notoriously overconfident — asked "how sure are you," a model will often declare high certainty for an answer it invented. This is why you cannot simply ask a generative bot for its confidence and trust the reply. The practical workarounds are structural: have the model return a calibrated label or an explicit "I'm not sure" option as part of a constrained output, or — for retrieval-augmented bots — use the retrieval score (did the knowledge base actually return relevant material?) as the real confidence signal instead of the model's self-assessment. A RAG bot that answers when retrieval found nothing is the generative equivalent of a classifier acting on a 0.2.

How platforms expose it

Whether intent confidence is a setting you tune or a black box you cannot see is itself an evaluation criterion. Developer-grade builders such as Botpress and Voiceflow surface per-intent scores and an adjustable confidence threshold directly, so the three-band design is fully in your hands. Flow-first marketing builders like Manychat and SendPulse lean on keyword and button logic on most channels, where the "confidence" question becomes how strictly a keyword has to match before the default-reply block fires. Support-desk platforms such as Intercom and Tidio often abstract the raw score away and expose the behavior instead — when does it answer, when does it ask, when does it route to an agent.

The question to put to any platform is not "how accurate is your NLU" but "can I see the confidence score, can I set the threshold, and can I make the bot confirm in the middle band instead of guessing." A vendor that lets you read and act on confidence is handing you the controls; one that hides it is asking you to trust a single invisible line you cannot move. The cost of that hidden line shows up later as wrong answers nobody logged, the same way a buried fallback rate hides the failures you most need to see.

FAQ

What is intent confidence in a chatbot?

It is the score an NLU engine attaches to its best guess at what a user meant — usually a number from 0 to 1 that expresses how sure the classifier is about the intent it picked. The bot uses that score to decide whether to act on the match, ask a clarifying question, or fall back. A high score means a strong, trustworthy match; a low score means the bot barely understood and should not guess.

Is the confidence score a probability?

No, and treating it as one is the most common mistake. A confidence of 0.8 does not mean the bot is right 80% of the time at that score. The number is a relative signal — a 0.9 is more trustworthy than a 0.5 on the same bot — but it is not a calibrated probability and is not comparable across platforms. Set your thresholds from real transcripts on your own bot, not from the number's face value.

What confidence threshold should I set?

There is no universal answer because the score is not portable, but the durable pattern is three bands rather than one line: act on high confidence (roughly 0.85+), confirm with a clarifying question in the middle (roughly 0.5-0.85), and fall back or hand off below about 0.5. Calibrate the exact cutoffs by reading transcripts and checking which scores actually matched correctly on your deployment.

Why does my bot give confident wrong answers?

Almost always because the confidence threshold is too low, so the bot acts on weak guesses instead of asking. The damage is invisible in your fallback rate — the user gets a wrong answer, not an apology, so nothing logs a failure. Catch it by reading transcripts for wrong-answer turns, and raise the threshold or add a confirmation step in the middle band.

Can I get a confidence score out of an LLM-based chatbot?

Not reliably by asking it — large language models are overconfident and will often claim certainty for answers they invented. Instead, constrain the model's output to include a calibrated label or an explicit "not sure" option, or, on a retrieval-augmented bot, use the retrieval score (whether the knowledge base returned relevant material) as the real confidence signal rather than the model's self-assessment.

Sources