K11 Tech Lab Agentic AI QA Demo

Research Preview · June 2026

Autonomous CI/CD Quality Assurance
Using LangGraph Multi-Agent
Orchestration & Risk-Proportionate HITL

A production-grade agentic AI system that autonomously executes the complete CI/CD quality gate — from pull request analysis to defect filing — without manual QA intervention for routine changes.

Kavita Jadhav · K11 Software Solutions LLC

DOI: 10.5281/zenodo.20543872

LangGraph14 Agents7 MCP Servers Human-in-the-LoopDeepEval + RAGASCI/CD

The Problem with Conventional CI/CD Quality Gates

Three structural deficiencies that limit software quality at scale

⚠️ Static Test Scope

Identical checks regardless of change risk. A one-line fix and a complete auth refactor receive the same treatment — wasting resources on low-risk changes and missing depth on high-risk ones.

⏱️ Sequential Execution

Independent test suites run in series. 8 suites × 8 minutes = 64 minutes per PR. In high-frequency pipelines this bottleneck directly throttles delivery velocity.

👤 Manual Coordination

QA engineers context-switch between test execution, result interpretation, defect documentation, and stakeholder communication — introducing delays and human error.

The result: development throughput scales with tooling; QA throughput scales with human attention. In high-velocity organisations, pull request volumes routinely exceed QA team capacity — forcing triage decisions that introduce systemic risk.

System Architecture — Four-Phase Pipeline

LangGraph StateGraph with shared CIPipelineState flowing through all nodes

⚡ GitHub PR Webhook Trigger

pull_request event → pipeline start

▼

🔍 PHASE 1 — Analysis

fetch_pr · retrieve_knowledge · score_risk · generate_test_plan

▼

🛡️ HITL Gate

risk ≥ 0.85 → suspend + notify reviewer

▼

⚙️ PHASE 2 — Parallel Agents

8 agents via Send API · concurrent execution

▼

📊 PHASE 3 — Reporting

aggregate · report · Jira tickets · Slack

▼

✅ EVALUATION — DeepEval + RAGAS

quality_gate → APPROVE / BLOCK / NEEDS_REVIEW

Shared State Design

CIPipelineState TypedDict with 24 fields. Phase 2 fields use Annotated[list, operator.add] reducers — lock-free concurrent writes from multiple agents.

Persistence

SQLite (dev) or PostgresSaver (prod) checkpoint store. Pipeline survives process restarts. HITL suspension is indefinite.

Deployment

Single docker compose up command. FastAPI webhook on port 9000. All 7 MCP servers containerised.

Phase 1 — Intelligent Analysis

Risk-proportionate test planning using LLM analysis + knowledge retrieval

1️⃣ fetch_pr_context

Calls GitHub MCP server to retrieve: PR diff, changed file list, commit messages, PR metadata, open review comments.

2️⃣ retrieve_knowledge

Semantic search over Knowledge Store MCP: historical defect records, regression suites, domain-specific testing guidelines for affected components.

3️⃣ score_risk (LLM)

GPT-4o-mini with temperature=0. Produces risk score [0.0–1.0] considering: lines changed, files affected, historical defect rate, security-sensitive patterns, test coverage gaps.

4️⃣ generate_test_plan (LLM)

Uses risk score + change analysis + knowledge context to select which test suites to activate. Low-risk: API + regression only. High-risk: all 8 suites.

# Risk score output example
{
  "risk_score": 0.91,
  "risk_factors": ["auth module modified", "SQL query changed", "3 historical incidents"],
  "activated_suites": ["api", "e2e", "security", "data", "regression"],
  "hitl_required": true  # risk >= 0.85
}

Human-in-the-Loop Gate

Risk-proportionate escalation — only triggers when risk score ≥ 0.85 (configurable)

🔴 risk_score
≥ 0.85

→

⏸️ interrupt()
suspend pipeline

→

📨 Slack alert
to reviewer

→

✅ POST /hitl
/decision

→

▶️ resume from
checkpoint

State Preservation

Full CIPipelineState serialised to checkpoint DB at suspension. Process restarts during review window are fully tolerated. No recomputation on resume.

Slack Notification

Includes: PR details, risk score, risk factors, reviewer instructions. Reviewer approves/rejects via /qa-approve <run_id> or REST API.

Audit Trail

Reviewer identity, decision, timestamp, and comment recorded immutably in pipeline state. Included in final report and all Jira tickets.

# Resume via REST API
curl -X POST http://localhost:9000/api/hitl/decision \
  -d '{"run_id":"abc123","decision":"approve","reviewer":"qa-lead","comment":"Auth changes reviewed"}'

Phase 2 — 8 Agents Running in Parallel

LangGraph Send API fans out to all activated agents concurrently — no locks, no message passing

🔌

APITestAgent

Playwright MCP

🌐

E2ETestAgent

Playwright MCP

⚡

PerformanceAgent

k6 MCP

🔒

SecurityAgent

GitHub MCP

🗄️

DataValidationAgent

Postgres MCP

🖥️

CrossBrowserAgent

Playwright MCP

♿

AccessibilityAgent

Playwright MCP

🔄

RegressionAgent

Knowledge Store MCP

Lock-Free Reducer Pattern

Annotated[list[AgentResult], operator.add] — all agents write results to shared state simultaneously. No coordination overhead. All results guaranteed present before Phase 3.

Execution Time Impact

Phase 2 duration = slowest single agent (PerformanceAgent ~2.5 min) rather than sum of all agents (~18.4 min serial). 87% time reduction.

MCP Integration Layer

7 Model Context Protocol servers decouple agent logic from external service implementations

GitHub MCP

:8001

→

PRContextFetcher, SecurityAgent

Playwright MCP

:8002

→

E2E, CrossBrowser, A11y, API

k6 MCP

:8003

→

PerformanceAgent

Jira MCP

:8004

→

DefectFilerAgent

Postgres MCP

:8005

→

DataValidationAgent

Slack MCP

:8006

→

ReportGenerator, HITL alerts

Knowledge Store MCP

:8007

→

KnowledgeRetriever, Regression

Agent call — no SDK coupling

result = await self.call_mcp(
  "jira", "create_issue",
  {"title": defect.title,
   "severity": "HIGH"}
)  # zero SDK imports

✅ Service Substitution — Zero Agent Changes

Jira → Linear: new MCP server + 1 env var update. Postgres → MySQL: same. All 14 agents execute unchanged. Confirmed in evaluation.

🔍 Transport-Layer Audit

Every external interaction logged at the MCP transport layer — independent of agent-level logging. Complete audit trail out of the box.

Self-Evaluation Layer

The system audits the quality of its own LLM-generated outputs on every run

DeepEval Metrics

AnswerRelevancy ≥ 0.70 — report addresses actual defects
Faithfulness ≥ 0.80 — claims grounded in test results
HallucinationRate ≤ 0.20 — no unsupported assertions

RAGAS Metrics

context_precision ≥ 0.55 — retrieved docs relevant
faithfulness ≥ 0.75 — risk assessment grounded
answer_relevancy ≥ 0.70 — test plan appropriate

Result: 108/120 runs passed all thresholds (90.0%). All 12 failures corresponded to runs subsequently assessed by human reviewers as containing report inaccuracies — confirming the evaluation layer's discriminative ability.

Key property: The system can express uncertainty about its own outputs — a property absent from conventional CI pipelines that emit binary pass/fail signals.

Eval Pass Rate90.0%

Empirical Results

120 pull requests · 3 repositories · 6 months (Jan–Jun 2026)

8.3min

Mean execution time
vs 63.4min sequential

87%

Time reduction
p < 0.001 Wilcoxon

91.2%

Defect detection rate
F1 = 0.913

0

Production escapes
during evaluation period

K11tech Agentic AI QA (parallel) — 8.3 min87% faster

Sequential-CI — 63.4 minbaseline

Defect Detection: K11tech (91.2%) vs Static-Risk (67.3%)+23.9pp

Agent portability — 0 code changes for Jira→Linear, Postgres→MySQL100%

K11 Tech Lab Agentic AI QA Demo · Open Source · Published · Reproducible

Everything needed to deploy, study, or extend the system

📄 Research Paper

Peer-reviewable preprint with full experimental methodology, results, and threat analysis.

doi.org/10.5281/zenodo.20543872

💻 Source Code

Complete implementation: 14 agents, 7 MCP servers, evaluation harness, Docker Compose.

github.com/K11-Software-Solutions/k11techlab-agentic-ai-qa-system

🔬 Evaluation Harness

4 scripts reproduce the full 120-PR experiment: scaffold repos → label dataset → run pipeline → analyse results (RQ1–RQ4).

🚀 Quick Start

git clone github.com/K11-Software-Solutions/...

cp .env.example .env  # add API keys

docker compose up -d

# pipeline live on :9000

📚 Want to learn how to build this? Take the LangGraph Course

11 interactive modules covering LangGraph fundamentals, state management, HITL, parallel execution, MCP integration, observability with LangSmith, and LLM evaluation — building up to this exact system as the capstone project.

Start the LangGraph Course →

Contact

kavita.jadhav@k11softwaresolutions.com

LangGraphMCPAgentic AI CI/CDHITLDeepEvalRAGAS