Beyond binary verdicts: aleatoric uncertainty quantification
in agentic CI/CD quality pipelines

K11tech Uncertainty QA System  ·  K11 Software Solutions LLC, Texas, United States  ·  Paper 4

Conformal prediction 6 uncertainty-aware features 90% coverage guarantee HITL confidence gating

A CI agent returns PASS or FAIL — but a 51% risk score and a 99% risk score both appear identical. Binary verdicts hide the uncertainty that matters most near the decision boundary.

TRADITIONAL BINARY GATE PR #482 risk_score = 0.73 PR #107 risk_score = 0.51 PR #334 risk_score = 0.49 PR #892 risk_score = 0.21 ⚠ 0.51 vs 0.49 — treated as opposites! conformal calibration UNCERTAINTY-AWARE GATE PR #482 0.73 ± 0.08 FAIL ✓ confident PR #107 0.51 ± 0.19 ⚠ REVIEW PR #334 0.49 ± 0.21 ⚠ REVIEW PR #892 0.21 ± 0.06 PASS ✓ confident ✓ PR #107 & #334 correctly flagged as uncertain — not opposite verdicts

The uncertainty quantification pipeline extracts 6 features, estimates aleatoric uncertainty with conformal prediction, and routes the decision through a tiered gate.

k11tech-uncertainty-qa — PR #107 · payment-service
$ k11tech-qa run --pr 107 --uncertainty-mode

A reliability diagram shows calibration quality. Bars should align with the diagonal — model confidence should match actual accuracy. Conformal calibration corrects systematic overconfidence near the decision boundary.

0.0 0.25 0.50 0.75 1.00 Mean predicted confidence Actual accuracy 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 perfect Before calibration ECE = 0.087 ← high After calibration ECE = 0.019 ✓ Pre-calibration (overconfident) Post conformal calibration Perfect calibration

The uncertainty score (0–1) determines routing. Low uncertainty → auto-decision. High uncertainty → human review. Watch four PRs land in their zones.

UNCERTAINTY SCORE AUTO [0 – 0.35] LOG + MONITOR [0.35 – 0.65] HITL REVIEW [0.65 – 1.0] 0.0 0.35 0.65 1.0 #892 #334 #107 #482 PR #892 score = 0.11 ± 0.06 → AUTO PASS confidence: high PR #334 score = 0.49 ± 0.21 → LOG + MONITOR uncertainty: medium PR #107 & #482 scores: 0.67, 0.88 → HITL REVIEW escalated to reviewer 6 Uncertainty-Aware Features driving the gate decision F1 LLM confidence σ F2 inter-agent agreement F3 coverage delta F4 historical failure rate F5 semantic drift F6 complexity change → combined via conformal nonconformity score → calibrated uncertainty bound [lower, upper] → coverage guarantee: 90% of true labels fall within the prediction set

Calibration quality

ECE (pre-calibration)
ECE (post-calibration)
Coverage @ 90% target
Brier score reduction

Gate performance

Auto-decision rate
HITL escalation rate
False HITL escalations
Reviewer agreement

Uncertainty features

Features used
Top feature: LLM conf. σweight 0.31
Conformal α0.10 (90% cov.)
Avg uncertainty bound± 0.142

K11tech Agentic AI QA series

1
Single-repo QA pipeline
LangGraph · 14 agents
2
Detect–Fix–Learn loop
Consensus gate · Auto-remediation
3
Cross-repo impact analysis
Contract registry · Graph traversal
4
Uncertainty quantification
This paper · Conformal prediction · HITL gate
5
Uncertainty source classification
DATA vs SCOPE · Type-stratified thresholds

Open source · Apache License Version 2.0 · builds on the Microservice QA System