Beyond a single threshold: uncertainty source classification
and type-stratified conformal prediction for agentic CI/CD

K11tech Agentic AI QA System · K11 Software Solutions LLC, Texas, United States · Paper 5

DATA_UNCERTAINTY SCOPE_UNCERTAINTY κ = 0.81 inter-rater 23% fewer HITL escalations

Two PRs both receive confidence = 0.31 and are flagged for HITL review — but the reason is completely different. Knowing why the agent is uncertain tells the reviewer exactly what to do next.

DATA_UNCERTAINTY · Aleatoric

A field removal is flagged: the consumer reads it in 3 of 8 request paths but it appears optional in 2 of those. The evidence is genuinely ambiguous — the contract diff alone cannot resolve it.

→ Read the diff more carefully

SCOPE_UNCERTAINTY · Epistemic

A Protobuf field is renamed but there is no prior schema version in the consumer registry. The agent lacks domain knowledge — the gap is structural, not evidence-related.

→ Escalate to a Protobuf specialist

⚠ Without classification

PR A · conf=0.31 → HITL REVIEW PR B · conf=0.31 → HITL REVIEW Identical tickets. No guidance. Reviewer guesses. Avg: 6.1 min

↓ with Uncertainty Source Classification ↓

PR A → DATA_UNCERTAINTY

Review the diff · examine 3 of 8 consumer paths

⏱ Resolved in 4.2 min avg

PR B → SCOPE_UNCERTAINTY

Escalate to Protobuf specialist · file coverage gap

⏱ Resolved in 4.2 min avg

A secondary LLM classification prompt fires concurrently for every low-confidence verdict (score < 0.35). It runs at O(n) — one small LLM call per flagged consumer — and never blocks the HITL gate.

k11tech-qa classify --run 2026-06-14 · 4 flagged consumers

$ k11tech-qa classify --uncertainty-source --concurrent

DATA_UNCERTAINTY needs a higher confidence threshold (ambiguous evidence is harder to auto-decide). SCOPE_UNCERTAINTY is detectable at a lower threshold (knowledge gaps produce more extreme low-confidence signals). Type-stratified thresholds cut unnecessary HITL escalation by 23%.

DATA_UNCERTAINTY

threshold 0.38
HITL rate 34.2%

SCOPE_UNCERTAINTY

threshold 0.29
HITL rate 18.7%

Global baseline [4]

threshold 0.32
HITL rate 28.4%

23% reduction in unnecessary HITL escalations 34 → 26 escalations · FNR preserved at 0.097 ≤ 0.10

SCOPE cases escape unnecessary escalation (18.7% vs 28.4%) while DATA cases receive the higher scrutiny they need (34.2% — higher than global but with lower FNR risk).

Knowing the uncertainty type transforms the reviewer's experience. Targeted action context cuts median resolution time from 6.1 min → 4.2 min — a 31% improvement (p < 0.05, Mann–Whitney U).

⚠ Before: single-threshold HITL

PR #334 · conf=0.31 · REVIEW REQUIRED

PR #107 · conf=0.28 · REVIEW REQUIRED

Reviewer receives: low confidence flag + diff link. No guidance on where to look or who to involve. Investigates from scratch.

⏱ 6.1 min avg

✓ After: uncertainty source classification

PR #334 → DATA_UNCERTAINTY
→ examine field removal at paths 3, 6, 7

PR #107 → SCOPE_UNCERTAINTY
→ escalate to Protobuf specialist

Reviewer receives: classification label + one-sentence explanation + targeted action. Knows immediately what to look at and who to involve.
⏱ 4.2 min avg (31% faster)

Classification reliability

LLM vs Human κ (avg)—

Human-human κ—

Classified verdicts—

UNCLASSIFIED rate—

HITL gate efficiency

HITL reduction—

Escalations: global34

Escalations: stratified26

FNR (target ≤ 0.10)—

Type-stratified thresholds

DATA_UNCERTAINTY α0.38 (HITL 34.2%)

SCOPE_UNCERTAINTY α0.29 (HITL 18.7%)

Global baseline α0.32 (HITL 28.4%)

Reviewer time saved—

K11tech Agentic AI QA series

Single-repo QA pipeline

LangGraph · 14 agents

Detect–Fix–Learn loop

Consensus gate · Auto-remediation

Cross-repo impact analysis

Contract registry · Graph traversal

Uncertainty quantification

Conformal prediction · HITL gate

Uncertainty source classification

This paper · DATA vs SCOPE · κ = 0.81

Open source · Apache License Version 2.0 · builds on the Microservice QA System

GitHub repo → doi.org/10.5281/zenodo.20685350 →

Step 1 of 5