Beyond a single threshold: uncertainty source classification
and type-stratified conformal prediction for agentic CI/CD

K11tech Agentic AI QA System  ·  K11 Software Solutions LLC, Texas, United States  ·  Paper 5

DATA_UNCERTAINTY SCOPE_UNCERTAINTY κ = 0.81 inter-rater 23% fewer HITL escalations

Two PRs both receive confidence = 0.31 and are flagged for HITL review — but the reason is completely different. Knowing why the agent is uncertain tells the reviewer exactly what to do next.

DATA_UNCERTAINTY · Aleatoric

A field removal is flagged: the consumer reads it in 3 of 8 request paths but it appears optional in 2 of those. The evidence is genuinely ambiguous — the contract diff alone cannot resolve it.

→ Read the diff more carefully
SCOPE_UNCERTAINTY · Epistemic

A Protobuf field is renamed but there is no prior schema version in the consumer registry. The agent lacks domain knowledge — the gap is structural, not evidence-related.

→ Escalate to a Protobuf specialist
⚠ Without classification
PR A · conf=0.31 → HITL REVIEW PR B · conf=0.31 → HITL REVIEW Identical tickets. No guidance. Reviewer guesses. Avg: 6.1 min
↓  with Uncertainty Source Classification  ↓
PR A → DATA_UNCERTAINTY
Review the diff · examine 3 of 8 consumer paths
⏱ Resolved in 4.2 min avg
PR B → SCOPE_UNCERTAINTY
Escalate to Protobuf specialist · file coverage gap
⏱ Resolved in 4.2 min avg

A secondary LLM classification prompt fires concurrently for every low-confidence verdict (score < 0.35). It runs at O(n) — one small LLM call per flagged consumer — and never blocks the HITL gate.

k11tech-qa classify --run 2026-06-14 · 4 flagged consumers
$ k11tech-qa classify --uncertainty-source --concurrent

DATA_UNCERTAINTY needs a higher confidence threshold (ambiguous evidence is harder to auto-decide). SCOPE_UNCERTAINTY is detectable at a lower threshold (knowledge gaps produce more extreme low-confidence signals). Type-stratified thresholds cut unnecessary HITL escalation by 23%.

DATA_UNCERTAINTY
threshold 0.38
HITL rate 34.2%
SCOPE_UNCERTAINTY
threshold 0.29
HITL rate 18.7%
Global baseline [4]
threshold 0.32
HITL rate 28.4%
23% reduction in unnecessary HITL escalations 34 → 26 escalations  ·  FNR preserved at 0.097 ≤ 0.10
SCOPE cases escape unnecessary escalation (18.7% vs 28.4%) while DATA cases receive the higher scrutiny they need (34.2% — higher than global but with lower FNR risk).

Knowing the uncertainty type transforms the reviewer's experience. Targeted action context cuts median resolution time from 6.1 min → 4.2 min — a 31% improvement (p < 0.05, Mann–Whitney U).

⚠ Before: single-threshold HITL

PR #334 · conf=0.31 · REVIEW REQUIRED
PR #107 · conf=0.28 · REVIEW REQUIRED
Reviewer receives: low confidence flag + diff link. No guidance on where to look or who to involve. Investigates from scratch.

⏱ 6.1 min avg

✓ After: uncertainty source classification

PR #334 → DATA_UNCERTAINTY
→ examine field removal at paths 3, 6, 7
PR #107 → SCOPE_UNCERTAINTY
→ escalate to Protobuf specialist
Reviewer receives: classification label + one-sentence explanation + targeted action. Knows immediately what to look at and who to involve.
⏱ 4.2 min avg  (31% faster)
INTER-RATER RELIABILITY — κ (Cohen's kappa) Human Rater 1 vs Human Rater 2 0.84 — Almost perfect LLM Classifier vs Human Rater 1 0.81 — Almost perfect LLM Classifier vs Human Rater 2 0.79 — Substantial κ=0.70 threshold 0.0 0.5 1.0

Classification reliability

LLM vs Human κ (avg)
Human-human κ
Classified verdicts
UNCLASSIFIED rate

HITL gate efficiency

HITL reduction
Escalations: global34
Escalations: stratified26
FNR (target ≤ 0.10)

Type-stratified thresholds

DATA_UNCERTAINTY α0.38 (HITL 34.2%)
SCOPE_UNCERTAINTY α0.29 (HITL 18.7%)
Global baseline α0.32 (HITL 28.4%)
Reviewer time saved

K11tech Agentic AI QA series

1
Single-repo QA pipeline
LangGraph · 14 agents
2
Detect–Fix–Learn loop
Consensus gate · Auto-remediation
3
Cross-repo impact analysis
Contract registry · Graph traversal
4
Uncertainty quantification
Conformal prediction · HITL gate
5
Uncertainty source classification
This paper · DATA vs SCOPE · κ = 0.81

Open source · Apache License Version 2.0 · builds on the Microservice QA System