Research Preview · June 2026
Autonomous CI/CD Quality Assurance
Using LangGraph Multi-Agent
Orchestration & Risk-Proportionate HITL
A production-grade agentic AI system that autonomously executes the complete CI/CD quality gate — from pull request analysis to defect filing — without manual QA intervention for routine changes.
Kavita Jadhav · K11 Software Solutions LLC
DOI: 10.5281/zenodo.20543872
LangGraph14 Agents7 MCP Servers Human-in-the-LoopDeepEval + RAGASCI/CD
The Problem with Conventional CI/CD Quality Gates
Three structural deficiencies that limit software quality at scale

⚠️ Static Test Scope

Identical checks regardless of change risk. A one-line fix and a complete auth refactor receive the same treatment — wasting resources on low-risk changes and missing depth on high-risk ones.

⏱️ Sequential Execution

Independent test suites run in series. 8 suites × 8 minutes = 64 minutes per PR. In high-frequency pipelines this bottleneck directly throttles delivery velocity.

👤 Manual Coordination

QA engineers context-switch between test execution, result interpretation, defect documentation, and stakeholder communication — introducing delays and human error.

The result: development throughput scales with tooling; QA throughput scales with human attention. In high-velocity organisations, pull request volumes routinely exceed QA team capacity — forcing triage decisions that introduce systemic risk.
System Architecture — Four-Phase Pipeline
LangGraph StateGraph with shared CIPipelineState flowing through all nodes

🔍 PHASE 1 — Analysis

fetch_pr · retrieve_knowledge · score_risk · generate_test_plan

🛡️ HITL Gate

risk ≥ 0.85 → suspend + notify reviewer

⚙️ PHASE 2 — Parallel Agents

8 agents via Send API · concurrent execution

✅ EVALUATION — DeepEval + RAGAS

quality_gate → APPROVE / BLOCK / NEEDS_REVIEW

Shared State Design

CIPipelineState TypedDict with 24 fields. Phase 2 fields use Annotated[list, operator.add] reducers — lock-free concurrent writes from multiple agents.

Persistence

SQLite (dev) or PostgresSaver (prod) checkpoint store. Pipeline survives process restarts. HITL suspension is indefinite.

Deployment

Single docker compose up command. FastAPI webhook on port 9000. All 7 MCP servers containerised.

Phase 1 — Intelligent Analysis
Risk-proportionate test planning using LLM analysis + knowledge retrieval

1️⃣ fetch_pr_context

Calls GitHub MCP server to retrieve: PR diff, changed file list, commit messages, PR metadata, open review comments.

2️⃣ retrieve_knowledge

Semantic search over Knowledge Store MCP: historical defect records, regression suites, domain-specific testing guidelines for affected components.

3️⃣ score_risk (LLM)

GPT-4o-mini with temperature=0. Produces risk score [0.0–1.0] considering: lines changed, files affected, historical defect rate, security-sensitive patterns, test coverage gaps.

4️⃣ generate_test_plan (LLM)

Uses risk score + change analysis + knowledge context to select which test suites to activate. Low-risk: API + regression only. High-risk: all 8 suites.

# Risk score output example { "risk_score": 0.91, "risk_factors": ["auth module modified", "SQL query changed", "3 historical incidents"], "activated_suites": ["api", "e2e", "security", "data", "regression"], "hitl_required": true # risk >= 0.85 }
Human-in-the-Loop Gate
Risk-proportionate escalation — only triggers when risk score ≥ 0.85 (configurable)
🔴 risk_score
≥ 0.85
⏸️ interrupt()
suspend pipeline
📨 Slack alert
to reviewer
✅ POST /hitl
/decision
▶️ resume from
checkpoint

State Preservation

Full CIPipelineState serialised to checkpoint DB at suspension. Process restarts during review window are fully tolerated. No recomputation on resume.

Slack Notification

Includes: PR details, risk score, risk factors, reviewer instructions. Reviewer approves/rejects via /qa-approve <run_id> or REST API.

Audit Trail

Reviewer identity, decision, timestamp, and comment recorded immutably in pipeline state. Included in final report and all Jira tickets.

# Resume via REST API curl -X POST http://localhost:9000/api/hitl/decision \ -d '{"run_id":"abc123","decision":"approve","reviewer":"qa-lead","comment":"Auth changes reviewed"}'
Phase 2 — 8 Agents Running in Parallel
LangGraph Send API fans out to all activated agents concurrently — no locks, no message passing
🔌
APITestAgent
Playwright MCP
🌐
E2ETestAgent
Playwright MCP
PerformanceAgent
k6 MCP
🔒
SecurityAgent
GitHub MCP
🗄️
DataValidationAgent
Postgres MCP
🖥️
CrossBrowserAgent
Playwright MCP
AccessibilityAgent
Playwright MCP
🔄
RegressionAgent
Knowledge Store MCP

Lock-Free Reducer Pattern

Annotated[list[AgentResult], operator.add] — all agents write results to shared state simultaneously. No coordination overhead. All results guaranteed present before Phase 3.

Execution Time Impact

Phase 2 duration = slowest single agent (PerformanceAgent ~2.5 min) rather than sum of all agents (~18.4 min serial). 87% time reduction.

MCP Integration Layer
7 Model Context Protocol servers decouple agent logic from external service implementations
GitHub MCP
:8001
PRContextFetcher, SecurityAgent
Playwright MCP
:8002
E2E, CrossBrowser, A11y, API
k6 MCP
:8003
PerformanceAgent
Jira MCP
:8004
DefectFilerAgent
Postgres MCP
:8005
DataValidationAgent
Slack MCP
:8006
ReportGenerator, HITL alerts
Knowledge Store MCP
:8007
KnowledgeRetriever, Regression

Agent call — no SDK coupling

result = await self.call_mcp( "jira", "create_issue", {"title": defect.title, "severity": "HIGH"} ) # zero SDK imports

✅ Service Substitution — Zero Agent Changes

Jira → Linear: new MCP server + 1 env var update. Postgres → MySQL: same. All 14 agents execute unchanged. Confirmed in evaluation.

🔍 Transport-Layer Audit

Every external interaction logged at the MCP transport layer — independent of agent-level logging. Complete audit trail out of the box.

Self-Evaluation Layer
The system audits the quality of its own LLM-generated outputs on every run

DeepEval Metrics

  • AnswerRelevancy ≥ 0.70 — report addresses actual defects
  • Faithfulness ≥ 0.80 — claims grounded in test results
  • HallucinationRate ≤ 0.20 — no unsupported assertions

RAGAS Metrics

  • context_precision ≥ 0.55 — retrieved docs relevant
  • faithfulness ≥ 0.75 — risk assessment grounded
  • answer_relevancy ≥ 0.70 — test plan appropriate
Result: 108/120 runs passed all thresholds (90.0%). All 12 failures corresponded to runs subsequently assessed by human reviewers as containing report inaccuracies — confirming the evaluation layer's discriminative ability.

Key property: The system can express uncertainty about its own outputs — a property absent from conventional CI pipelines that emit binary pass/fail signals.
Eval Pass Rate90.0%
Empirical Results
120 pull requests · 3 repositories · 6 months (Jan–Jun 2026)
8.3min
Mean execution time
vs 63.4min sequential
87%
Time reduction
p < 0.001 Wilcoxon
91.2%
Defect detection rate
F1 = 0.913
0
Production escapes
during evaluation period
K11tech Agentic AI QA (parallel) — 8.3 min87% faster
Sequential-CI — 63.4 minbaseline
Defect Detection: K11tech (91.2%) vs Static-Risk (67.3%)+23.9pp
Agent portability — 0 code changes for Jira→Linear, Postgres→MySQL100%
K11 Tech Lab Agentic AI QA Demo · Open Source · Published · Reproducible
Everything needed to deploy, study, or extend the system

📄 Research Paper

Peer-reviewable preprint with full experimental methodology, results, and threat analysis.

doi.org/10.5281/zenodo.20543872

💻 Source Code

Complete implementation: 14 agents, 7 MCP servers, evaluation harness, Docker Compose.

github.com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system

🔬 Evaluation Harness

4 scripts reproduce the full 120-PR experiment: scaffold repos → label dataset → run pipeline → analyse results (RQ1–RQ4).

🚀 Quick Start

git clone github.com/K11-Software-Solutions/...
cp .env.example .env # add API keys
docker compose up -d
# pipeline live on :9000

📚 Want to learn how to build this? Take the LangGraph Course

11 interactive modules covering LangGraph fundamentals, state management, HITL, parallel execution, MCP integration, observability with LangSmith, and LLM evaluation — building up to this exact system as the capstone project.

Start the LangGraph Course →
Contact
kavita.jadhav@k11softwaresolutions.com
LangGraphMCPAgentic AI CI/CDHITLDeepEvalRAGAS