Beyond Static Gates: Detect-Fix-Learn Demo

The Detect – Fix – Learn Loop

The K11tech Agentic QA System (Paper 1) achieved 91.2% defect detection but relied on a single LLM for risk scoring, a fixed escalation threshold, and left every fix to developers. This paper closes the loop with three interlocking innovations.

🔍

DETECT
Consensus Gate

➡

🔧

FIX
Auto-Remediation

➡

🧠

LEARN
Adaptive Threshold

↺

Full Pipeline Flow

GitHub PR

➡

Phase 1: Analysis

➡

🆔 Consensus Gate (NEW)

➡

HITL Gate

↓

Phase 2: 8 Agents

➡

🔧 Phase 2.5: Remediation (NEW)

➡

Phase 3: Report

↓

🧠 Threshold Learner (NEW)

↺ feedback

0.927

F1 Score (vs 0.913 baseline)

34%

Reduction in unnecessary HITL

89.3%

Auto-patch acceptance rate

① Multi-LLM Consensus Gate — Epistemic Uncertainty as a Safety Signal

Instead of trusting a single model, three LLMs independently score each PR. When they disagree, that disagreement itself is the uncertainty signal — forcing HITL escalation regardless of the individual scores.

GPT-4o

0.72

HIGH risk

Claude Sonnet 4.6

0.75

HIGH risk

Gemini 1.5 Pro

0.74

HIGH risk

Std deviation: 0.015 Threshold: < 0.20

Agreement mode: majority (2 of 3)

✓ CONSENSUS REACHED — Pipeline continues. Averaged score: 0.74 (HIGH)

⚠ MODELS DISAGREE — Epistemic uncertainty detected. Forcing HITL escalation regardless of risk score.

Agreement Mode Guide

Unanimous

All 3 agree + std < 0.20

HITL rate: 20-30%

Best for: research, high-security

Majority ★

2 of 3 agree + std < 0.20

HITL rate: 10-15%

Best for: production CI/CD

Weighted

std < 0.20 only

HITL rate: 5-8%

Best for: low-risk repos

71% of consensus-forced HITL activations were independently assessed as genuinely ambiguous PRs.

Phase 2 — 8 Parallel Specialist Agents

After consensus passes the HITL gate, all 8 test agents execute in parallel via the LangGraph Send API.

🌐API Agent

🎭Playwright

⚡Performance

🔒Security

💾Data

🌐Browser

♿A11y

🔁Regression

② AutoRemediationAgent — Phase 2.5

For each safe-class defect, the agent fetches the source file, generates a unified-diff patch with a self-assessed confidence score, and opens a remediation PR. Patches below 0.80 confidence are skipped.

accessibility_violation — missing aria-label Generating patch...

missing_docstring — UserAuthService.validate() Queued

unused_import — import os Queued

③ Adaptive HITL Threshold Learner

After each pipeline run, the reviewer's decision updates the threshold via an exponential moving average (EMA, alpha = 0.05), bounded within safety constraints [0.70, 0.95]. The system learns the team's true risk tolerance automatically.

Simulate reviewer decisions:

Current threshold (theta)

0.85

Safety floor

0.70

Safety ceiling

0.95

Decisions recorded: 0

Approvals: 0

Rejections: 0

MAE from true preferred: --

Converges to within MAE = 0.018 within 80 decisions. Escape rate remains 0% under all scenarios including distribution shift.

Result: A Self-Improving QA Pipeline

0.927

F1 (vs 0.913)

34%

Fewer HITL alerts

0%

Escape rate increase

Open-source — github.com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system | doi:10.5281/zenodo.20551586