Two PRs both receive confidence = 0.31 and are flagged for HITL review — but the reason is completely different. Knowing why the agent is uncertain tells the reviewer exactly what to do next.
DATA_UNCERTAINTY · Aleatoric
A field removal is flagged: the consumer reads it in 3 of 8 request paths but it appears optional in 2 of those. The evidence is genuinely ambiguous — the contract diff alone cannot resolve it.
→ Read the diff more carefully
SCOPE_UNCERTAINTY · Epistemic
A Protobuf field is renamed but there is no prior schema version in the consumer registry. The agent lacks domain knowledge — the gap is structural, not evidence-related.
→ Escalate to a Protobuf specialist
⚠ Without classification
PR A · conf=0.31 → HITL REVIEWPR B · conf=0.31 → HITL REVIEWIdentical tickets. No guidance. Reviewer guesses. Avg: 6.1 min
↓ with Uncertainty Source Classification ↓
PR A → DATA_UNCERTAINTY
Review the diff · examine 3 of 8 consumer paths
⏱ Resolved in 4.2 min avg
PR B → SCOPE_UNCERTAINTY
Escalate to Protobuf specialist · file coverage gap
⏱ Resolved in 4.2 min avg
A secondary LLM classification prompt fires concurrently for every low-confidence verdict (score < 0.35). It runs at O(n) — one small LLM call per flagged consumer — and never blocks the HITL gate.
DATA_UNCERTAINTY needs a higher confidence threshold (ambiguous evidence is harder to auto-decide). SCOPE_UNCERTAINTY is detectable at a lower threshold (knowledge gaps produce more extreme low-confidence signals). Type-stratified thresholds cut unnecessary HITL escalation by 23%.
DATA_UNCERTAINTY
threshold 0.38 HITL rate 34.2%
SCOPE_UNCERTAINTY
threshold 0.29 HITL rate 18.7%
Global baseline [4]
threshold 0.32 HITL rate 28.4%
23% reduction in unnecessary HITL escalations34 → 26 escalations · FNR preserved at 0.097 ≤ 0.10
SCOPE cases escape unnecessary escalation (18.7% vs 28.4%) while DATA cases receive the higher scrutiny they need (34.2% — higher than global but with lower FNR risk).
Knowing the uncertainty type transforms the reviewer's experience. Targeted action context cuts median resolution time from 6.1 min → 4.2 min — a 31% improvement (p < 0.05, Mann–Whitney U).
⚠ Before: single-threshold HITL
PR #334 · conf=0.31 · REVIEW REQUIRED
PR #107 · conf=0.28 · REVIEW REQUIRED
Reviewer receives: low confidence flag + diff link. No guidance on where to look or who to involve. Investigates from scratch.
⏱ 6.1 min avg
✓ After: uncertainty source classification
PR #334 → DATA_UNCERTAINTY → examine field removal at paths 3, 6, 7
PR #107 → SCOPE_UNCERTAINTY → escalate to Protobuf specialist
Reviewer receives: classification label + one-sentence explanation + targeted action. Knows immediately what to look at and who to involve. ⏱ 4.2 min avg (31% faster)
Classification reliability
LLM vs Human κ (avg)—
Human-human κ—
Classified verdicts—
UNCLASSIFIED rate—
HITL gate efficiency
HITL reduction—
Escalations: global34
Escalations: stratified26
FNR (target ≤ 0.10)—
Type-stratified thresholds
DATA_UNCERTAINTY α0.38 (HITL 34.2%)
SCOPE_UNCERTAINTY α0.29 (HITL 18.7%)
Global baseline α0.32 (HITL 28.4%)
Reviewer time saved—
K11tech Agentic AI QA series
1
Single-repo QA pipeline
LangGraph · 14 agents
2
Detect–Fix–Learn loop
Consensus gate · Auto-remediation
3
Cross-repo impact analysis
Contract registry · Graph traversal
4
Uncertainty quantification
Conformal prediction · HITL gate
5
Uncertainty source classification
This paper · DATA vs SCOPE · κ = 0.81
Open source · Apache License Version 2.0 · builds on the Microservice QA System