Research Preview · June 2026
Autonomous CI/CD Quality Assurance
Using LangGraph Multi-Agent
Orchestration & Risk-Proportionate HITL
A production-grade agentic AI system that autonomously executes the complete CI/CD quality gate — from pull request analysis to defect filing — without manual QA intervention for routine changes.
LangGraph14 Agents7 MCP Servers
Human-in-the-LoopDeepEval + RAGASCI/CD
The Problem with Conventional CI/CD Quality Gates
Three structural deficiencies that limit software quality at scale
⚠️ Static Test Scope
Identical checks regardless of change risk. A one-line fix and a complete auth refactor receive the same treatment — wasting resources on low-risk changes and missing depth on high-risk ones.
⏱️ Sequential Execution
Independent test suites run in series. 8 suites × 8 minutes = 64 minutes per PR. In high-frequency pipelines this bottleneck directly throttles delivery velocity.
👤 Manual Coordination
QA engineers context-switch between test execution, result interpretation, defect documentation, and stakeholder communication — introducing delays and human error.
The result: development throughput scales with tooling; QA throughput scales with human attention. In high-velocity organisations, pull request volumes routinely exceed QA team capacity — forcing triage decisions that introduce systemic risk.
System Architecture — Four-Phase Pipeline
LangGraph StateGraph with shared CIPipelineState flowing through all nodes
⚡ GitHub PR Webhook Trigger
pull_request event → pipeline start
▼
🔍 PHASE 1 — Analysis
fetch_pr · retrieve_knowledge · score_risk · generate_test_plan
▼
🛡️ HITL Gate
risk ≥ 0.85 → suspend + notify reviewer
▼
⚙️ PHASE 2 — Parallel Agents
8 agents via Send API · concurrent execution
▼
📊 PHASE 3 — Reporting
aggregate · report · Jira tickets · Slack
▼
✅ EVALUATION — DeepEval + RAGAS
quality_gate → APPROVE / BLOCK / NEEDS_REVIEW
Shared State Design
CIPipelineState TypedDict with 24 fields. Phase 2 fields use Annotated[list, operator.add] reducers — lock-free concurrent writes from multiple agents.
Persistence
SQLite (dev) or PostgresSaver (prod) checkpoint store. Pipeline survives process restarts. HITL suspension is indefinite.
Deployment
Single docker compose up command. FastAPI webhook on port 9000. All 7 MCP servers containerised.
Phase 1 — Intelligent Analysis
Risk-proportionate test planning using LLM analysis + knowledge retrieval
1️⃣ fetch_pr_context
Calls GitHub MCP server to retrieve: PR diff, changed file list, commit messages, PR metadata, open review comments.
2️⃣ retrieve_knowledge
Semantic search over Knowledge Store MCP: historical defect records, regression suites, domain-specific testing guidelines for affected components.
3️⃣ score_risk (LLM)
GPT-4o-mini with temperature=0. Produces risk score [0.0–1.0] considering: lines changed, files affected, historical defect rate, security-sensitive patterns, test coverage gaps.
4️⃣ generate_test_plan (LLM)
Uses risk score + change analysis + knowledge context to select which test suites to activate. Low-risk: API + regression only. High-risk: all 8 suites.
# Risk score output example
{
"risk_score": 0.91,
"risk_factors": ["auth module modified", "SQL query changed", "3 historical incidents"],
"activated_suites": ["api", "e2e", "security", "data", "regression"],
"hitl_required": true # risk >= 0.85
}
Human-in-the-Loop Gate
Risk-proportionate escalation — only triggers when risk score ≥ 0.85 (configurable)
🔴 risk_score
≥ 0.85
→
⏸️ interrupt()
suspend pipeline
→
📨 Slack alert
to reviewer
→
✅ POST /hitl
/decision
→
▶️ resume from
checkpoint
State Preservation
Full CIPipelineState serialised to checkpoint DB at suspension. Process restarts during review window are fully tolerated. No recomputation on resume.
Slack Notification
Includes: PR details, risk score, risk factors, reviewer instructions. Reviewer approves/rejects via /qa-approve <run_id> or REST API.
Audit Trail
Reviewer identity, decision, timestamp, and comment recorded immutably in pipeline state. Included in final report and all Jira tickets.
# Resume via REST API
curl -X POST http://localhost:9000/api/hitl/decision \
-d '{"run_id":"abc123","decision":"approve","reviewer":"qa-lead","comment":"Auth changes reviewed"}'
Phase 2 — 8 Agents Running in Parallel
LangGraph Send API fans out to all activated agents concurrently — no locks, no message passing
🔌
APITestAgent
Playwright MCP
🌐
E2ETestAgent
Playwright MCP
🗄️
DataValidationAgent
Postgres MCP
🖥️
CrossBrowserAgent
Playwright MCP
♿
AccessibilityAgent
Playwright MCP
🔄
RegressionAgent
Knowledge Store MCP
Lock-Free Reducer Pattern
Annotated[list[AgentResult], operator.add] — all agents write results to shared state simultaneously. No coordination overhead. All results guaranteed present before Phase 3.
Execution Time Impact
Phase 2 duration = slowest single agent (PerformanceAgent ~2.5 min) rather than sum of all agents (~18.4 min serial). 87% time reduction.
MCP Integration Layer
7 Model Context Protocol servers decouple agent logic from external service implementations
GitHub MCP
:8001
→
PRContextFetcher, SecurityAgent
Playwright MCP
:8002
→
E2E, CrossBrowser, A11y, API
k6 MCP
:8003
→
PerformanceAgent
Jira MCP
:8004
→
DefectFilerAgent
Postgres MCP
:8005
→
DataValidationAgent
Slack MCP
:8006
→
ReportGenerator, HITL alerts
Knowledge Store MCP
:8007
→
KnowledgeRetriever, Regression
Agent call — no SDK coupling
result = await self.call_mcp(
"jira", "create_issue",
{"title": defect.title,
"severity": "HIGH"}
) # zero SDK imports
✅ Service Substitution — Zero Agent Changes
Jira → Linear: new MCP server + 1 env var update. Postgres → MySQL: same. All 14 agents execute unchanged. Confirmed in evaluation.
🔍 Transport-Layer Audit
Every external interaction logged at the MCP transport layer — independent of agent-level logging. Complete audit trail out of the box.
Self-Evaluation Layer
The system audits the quality of its own LLM-generated outputs on every run
DeepEval Metrics
- AnswerRelevancy ≥ 0.70 — report addresses actual defects
- Faithfulness ≥ 0.80 — claims grounded in test results
- HallucinationRate ≤ 0.20 — no unsupported assertions
RAGAS Metrics
- context_precision ≥ 0.55 — retrieved docs relevant
- faithfulness ≥ 0.75 — risk assessment grounded
- answer_relevancy ≥ 0.70 — test plan appropriate
Result: 108/120 runs passed all thresholds (90.0%). All 12 failures corresponded to runs subsequently assessed by human reviewers as containing report inaccuracies — confirming the evaluation layer's discriminative ability.
Key property: The system can express uncertainty about its own outputs — a property absent from conventional CI pipelines that emit binary pass/fail signals.
Empirical Results
120 pull requests · 3 repositories · 6 months (Jan–Jun 2026)
8.3min
Mean execution time
vs 63.4min sequential
87%
Time reduction
p < 0.001 Wilcoxon
91.2%
Defect detection rate
F1 = 0.913
0
Production escapes
during evaluation period
K11tech Agentic AI QA (parallel) — 8.3 min87% faster
Sequential-CI — 63.4 minbaseline
Defect Detection: K11tech (91.2%) vs Static-Risk (67.3%)+23.9pp
Agent portability — 0 code changes for Jira→Linear, Postgres→MySQL100%
K11 Tech Lab Agentic AI QA Demo · Open Source · Published · Reproducible
Everything needed to deploy, study, or extend the system
📄 Research Paper
Peer-reviewable preprint with full experimental methodology, results, and threat analysis.
doi.org/10.5281/zenodo.20543872
💻 Source Code
Complete implementation: 14 agents, 7 MCP servers, evaluation harness, Docker Compose.
github.com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system
🔬 Evaluation Harness
4 scripts reproduce the full 120-PR experiment: scaffold repos → label dataset → run pipeline → analyse results (RQ1–RQ4).
🚀 Quick Start
git clone github.com/K11-Software-Solutions/...
cp .env.example .env # add API keys
docker compose up -d
# pipeline live on :9000
📚 Want to learn how to build this? Take the LangGraph Course
11 interactive modules covering LangGraph fundamentals, state management, HITL, parallel execution, MCP integration, observability with LangSmith, and LLM evaluation — building up to this exact system as the capstone project.
Start the LangGraph Course →
Contact
kavita.jadhav@k11softwaresolutions.com
LangGraphMCPAgentic AI
CI/CDHITLDeepEvalRAGAS