Interactive demo — Consensus Risk Scoring, Automated Remediation & Adaptive HITL Threshold Learning
The K11tech Agentic QA System (Paper 1) achieved 91.2% defect detection but relied on a single LLM for risk scoring, a fixed escalation threshold, and left every fix to developers. This paper closes the loop with three interlocking innovations.
Instead of trusting a single model, three LLMs independently score each PR. When they disagree, that disagreement itself is the uncertainty signal — forcing HITL escalation regardless of the individual scores.
71% of consensus-forced HITL activations were independently assessed as genuinely ambiguous PRs.
After consensus passes the HITL gate, all 8 test agents execute in parallel via the LangGraph Send API.
For each safe-class defect, the agent fetches the source file, generates a unified-diff patch with a self-assessed confidence score, and opens a remediation PR. Patches below 0.80 confidence are skipped.
After each pipeline run, the reviewer's decision updates the threshold via an exponential moving average (EMA, alpha = 0.05), bounded within safety constraints [0.70, 0.95]. The system learns the team's true risk tolerance automatically.
Converges to within MAE = 0.018 within 80 decisions. Escape rate remains 0% under all scenarios including distribution shift.