K11 Tech Lab Agentic AI QA System — Paper 2

Beyond Static Gates: Detect – Fix – Learn

Interactive demo — Consensus Risk Scoring, Automated Remediation & Adaptive HITL Threshold Learning

The Detect – Fix – Learn Loop

The K11tech Agentic QA System (Paper 1) achieved 91.2% defect detection but relied on a single LLM for risk scoring, a fixed escalation threshold, and left every fix to developers. This paper closes the loop with three interlocking innovations.

🔍
DETECT
Consensus Gate
🔧
FIX
Auto-Remediation
🧠
LEARN
Adaptive Threshold
Full Pipeline Flow
GitHub PR
Phase 1: Analysis
🆔 Consensus Gate (NEW)
HITL Gate
Phase 2: 8 Agents
🔧 Phase 2.5: Remediation (NEW)
Phase 3: Report
🧠 Threshold Learner (NEW)
↺ feedback
0.927
F1 Score (vs 0.913 baseline)
34%
Reduction in unnecessary HITL
89.3%
Auto-patch acceptance rate
① Multi-LLM Consensus Gate — Epistemic Uncertainty as a Safety Signal

Instead of trusting a single model, three LLMs independently score each PR. When they disagree, that disagreement itself is the uncertainty signal — forcing HITL escalation regardless of the individual scores.

GPT-4o
0.72
HIGH risk
Claude Sonnet 4.6
0.75
HIGH risk
Gemini 1.5 Pro
0.74
HIGH risk
Std deviation: 0.015   Threshold: < 0.20
Agreement mode: majority (2 of 3)
✓ CONSENSUS REACHED — Pipeline continues. Averaged score: 0.74 (HIGH)
⚠ MODELS DISAGREE — Epistemic uncertainty detected. Forcing HITL escalation regardless of risk score.
Agreement Mode Guide
Unanimous
All 3 agree + std < 0.20
HITL rate: 20-30%
Best for: research, high-security
Majority ★
2 of 3 agree + std < 0.20
HITL rate: 10-15%
Best for: production CI/CD
Weighted
std < 0.20 only
HITL rate: 5-8%
Best for: low-risk repos

71% of consensus-forced HITL activations were independently assessed as genuinely ambiguous PRs.

Phase 2 — 8 Parallel Specialist Agents

After consensus passes the HITL gate, all 8 test agents execute in parallel via the LangGraph Send API.

🌐API Agent
🎭Playwright
Performance
🔒Security
💾Data
🌐Browser
A11y
🔁Regression
② AutoRemediationAgent — Phase 2.5

For each safe-class defect, the agent fetches the source file, generates a unified-diff patch with a self-assessed confidence score, and opens a remediation PR. Patches below 0.80 confidence are skipped.

accessibility_violation — missing aria-label Generating patch...
missing_docstring — UserAuthService.validate() Queued
unused_import — import os Queued
③ Adaptive HITL Threshold Learner

After each pipeline run, the reviewer's decision updates the threshold via an exponential moving average (EMA, alpha = 0.05), bounded within safety constraints [0.70, 0.95]. The system learns the team's true risk tolerance automatically.

Simulate reviewer decisions:
Current threshold (theta)
0.85
Safety floor
0.70
Safety ceiling
0.95
0.95 0.85 0.75
Decisions recorded: 0
Approvals: 0
Rejections: 0
MAE from true preferred: --

Converges to within MAE = 0.018 within 80 decisions. Escape rate remains 0% under all scenarios including distribution shift.

Result: A Self-Improving QA Pipeline
0.927
F1 (vs 0.913)
34%
Fewer HITL alerts
0%
Escape rate increase
Open-source — github.com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system   |   doi:10.5281/zenodo.20551586