Case Study Production AI Systems Deployed

AI/CD: Production-Grade Autonomous Developer Agent (Human-in-the-Loop)

TL;DR: AI/CD is designed to replace the execution capacity typically provided by junior to mid-level engineers for well-scoped tasks. Like human engineers at those levels, it does not infer intent or guess requirements—it requires explicit scope, constraints, and acceptance criteria. Built as a human-in-the-loop autonomous developer agent that converts well-scoped, labeled GitHub issues into production-ready pull requests via a 5-stage pipeline (triage → plan → generate → review → PR). The system handles feature development and bug fixes, treating each issue as a first-class engineering task: planning implementation, modifying production code, authoring tests, validating results, and submitting PRs for mandatory human review. Implemented as a multi-tenant SaaS platform with Postgres RLS, policy enforcement, vector context retrieval, and production observability. Multi-tenant architecture uses RLS isolation with dual-schema Postgres separating Go (operational) and Next.js (admin) concerns. Full observability via Sentry error tracking, Prometheus (30+ metrics), and Grafana dashboards. Early production deployment to 1–5 organizations processed ~50–100 labeled issues, generated ~35–85 PRs, and merged ~28–75 PRs over 2 months.

👤Role: Sole Engineer / Architect

📏Scope: Full-stack SaaS platform: Go backend (11 packages) + Next.js 15 admin dashboard

⏱️Timeline: 1-2 months (solo implementation)

🔧Environment: Docker / PostgreSQL / Go 1.24 / Next.js 15 / Hetzner Cloud

Quick Links

📸Snapshot

Business Objective

Growing issue backlog required execution capacity for well-scoped, routine engineering tasks (bug fixes, feature implementations, refactors). Team had bandwidth to define and review work, but lacked junior-to-mid level execution capacity to implement it. Rather than adding headcount, needed a system that could reliably execute well-specified tickets with human oversight.

Primary Technical Outcome

Shipped a production system that replaces junior-to-mid level execution capacity for well-scoped tasks. GitHub App ingestion → 5-stage pipeline (triage, plan, generate, review, PR) → production-ready pull requests with mandatory human review. Backed by multi-tenant controls, policy enforcement, semantic context retrieval, and full observability.

My Role

End-to-end ownership: architecture, implementation, testing, deployment, and documentation Designed dual-schema Postgres pattern for Go (operational) + Next.js (admin/config) separation Implemented 5-stage pipeline and policy engine (label/path/change limits) Built multi-tenant isolation with RLS + app-layer authorization Integrated embeddings-based context retrieval (Qdrant) Instrumented production observability (Sentry + Prometheus/Grafana) and built a comprehensive test suite

Key Metrics

Solo engineer; shipped to production in ~6-8 weeks; deployed to 1-5 organizations
Production metrics (2-month window; labeled issues only): Ingested (labeled): ~50-100 issues → Eligible (policy + triage allowed): ~42-90 → PRs created: ~35-85 → Merged after human review: ~28-75
Median time issue → PR: ~6-12 hours (vs. days for manual implementation)
Ticket rejection rate: ~10-15%, primarily due to insufficient scope, missing acceptance criteria, or policy violations. Rejected issues typically resembled requests that would also block a human engineer ("do X somehow", unclear intent, or unbounded scope).
Transient stage errors: ~10-15% of runs (LLM/API timeouts, GitHub rate limits, parsing); typically resolved via automated retry/backoff
Task envelope: Capable of feature development and bug fixes for small-to-medium scoped changes (new features, refactors, config updates, test additions) with explicit change limits per policy (files/LOC), and repo access constrained to enabled repositories only
Safety: 100% PR review required; zero auto-merge
Multi-tenant architecture: Row-Level Security + dual-schema PostgreSQL (Go + Next.js)
5-stage AI pipeline: triage → plan → generate → review → PR (independent quality gates)
Observability: Sentry error tracking + Prometheus (30+ metrics) + Grafana dashboards
Test coverage: 148+ tests covering critical paths (triage, policy engine, multi-tenant isolation)
⚠️ Metrics based on limited production data (1-5 organizations, ~2 months operation)

Stakeholders

Company Leadership: Gained automated engineering capacity for routine tasks without incremental headcount. Enabled feature delivery despite resource constraints. Established reusable automation pattern for future internal systems.
Engineering Team: Freed from repetitive manual bug fixes. AI handles routine issues while humans focus on complex problems. Quality maintained via mandatory PR review process.
Early Adopter Organizations (1-5): Access to automated developer assistance for issue backlog. Faster turnaround on routine fixes. Human oversight maintains code quality standards.

📋Context

AI/CD (Autonomous Intelligence / Continuous Delivery) is a GitHub App that acts as an autonomous developer agent. Execution Model: AI/CD is intentionally designed to replace the execution capacity typically provided by junior to mid-level engineers for well-scoped tasks. Like human engineers at those levels, it does not infer intent or guess requirements. When tickets are explicit—clear scope, constraints, and acceptance criteria—the system plans, implements, tests, validates, and submits production-ready pull requests. When tickets are vague or underspecified, the system stalls or rejects them during triage, mirroring how human execution breaks down in real teams. This behavior is a feature: it prevents silent misinterpretation, enforces engineering discipline, and shifts clarity upstream where it belongs. Why This Approach Works: In practice, poorly specified tickets cause junior and mid-level engineers to spin, over-interpret, or block. AI/CD behaves the same way—but enforces that constraint explicitly through triage and rejection rather than silent time loss. Treating underspecified work as invalid input is intentional. Silent guessing is one of the primary sources of technical debt in human teams; AI/CD makes that failure mode visible and measurable. By enforcing ticket clarity and rejecting ambiguous work, the system shifts effort upstream—forcing product and engineering leadership to define work properly once, instead of paying repeated execution cost downstream. Capability: The system handles feature development and bug fixes. It treats each issue as a first-class engineering task: planning the implementation, modifying production code, authoring tests, validating results, and submitting a pull request for review. Because only explicitly labeled, well-scoped issues entered the pipeline, feature implementation success rates were high once an issue passed triage. Ticket Quality Examples (Abstracted): • Accepted: "Add pagination to /api/jobs with cursor-based pagination; update service + handler; add unit tests for pagination edge cases." • Rejected: "Make job handling better / faster." System Design: Only issues explicitly labeled for AI/CD (e.g., ai-task) entered the pipeline; unlabeled issues were ignored. When labeled issues are opened in monitored repositories, the system: (1) Evaluates if the issue is actionable (triage), (2) Creates an implementation plan with relevant code context, (3) Generates code changes automatically, (4) Self-reviews the changes for quality/security, (5) Opens a pull request for human review. The orchestrator decomposes a task into staged sub-agent work (planning, context retrieval, implementation, review, and test authoring). For eligible issues, it typically produces a complete PR: code changes, tests, and a structured PR description aligned to repo conventions—then hands off to mandatory human review. The system is designed for complete human-in-the-loop oversight - all code must be reviewed before merge. Architecture: The monorepo includes ai-cd/ (Go backend service for webhook handler, orchestrator, AI pipeline) and ai-cd-marketing/ (Next.js admin dashboard for configuration UI). Shared infrastructure includes PostgreSQL with dual-schema architecture (public schema for backend operational data, aicd_admin schema for frontend configuration), Redis job queue (Asynq), Qdrant vector database for code embeddings, and Docker Compose for local development.

⚠️Problem

Symptoms

Growing issue backlog with limited capacity to address routine bugs and features
Manual bug fixes consuming disproportionate time from small engineering team
Technical debt accumulating due to lack of routine maintenance bandwidth
Knowledge silos - certain issues only certain team members could fix efficiently
Inconsistent code patterns across repositories (no standardized approach)
No automation for routine, repetitive tasks (every fix required manual implementation)

Root Causes

Capacity Constraint: Small engineering team with limited bandwidth for incremental headcount. Growing backlog required automation to scale engineering output without proportional team growth.
Manual Process: Every bug fix, feature, and maintenance task required human implementation time. No system existed to automate routine, repetitive tasks.
Infrastructure Gap: No infrastructure for autonomous code generation. Required building from scratch: AI pipeline, safety controls, multi-tenant isolation, policy engine, observability.

Risk if Unresolved

Engineering Risk: Accumulating technical debt becomes unmanageable, bug backlog grows until product becomes unstable, team burnout from endless manual work on routine tasks, quality degradation due to rushed fixes.
Business Risk: Slow velocity prevents delivering customer-requested features, competitive disadvantage due to inability to ship quickly, forced to deprioritize routine maintenance in favor of critical features (technical debt spiral).

🔒Constraints & Requirements

Constraints

Timeline: 1-2 month delivery window (solo engineer, no team support), had to balance speed with production-ready quality
Budget: Zero budget for third-party services or commercial tools, LLM API costs must be carefully controlled, infrastructure must run on minimal cloud resources
Team Size: Solo engineer (no design, no QA, no DevOps support), all architecture, implementation, testing, documentation done by one person
Technical: Must integrate with existing GitHub workflows, cannot require changes to existing codebases, must maintain security isolation between organizations, cannot auto-merge code (human review required)
Scope: Must be production-ready not a demo/MVP, need admin dashboard for configuration, require monitoring and observability from day one, need proper multi-tenant architecture for SaaS viability

Success Criteria

Functional: Process GitHub issue webhooks automatically, generate code changes that compile and pass basic quality checks, create pull requests with full context and audit trail, support multiple organizations with complete data isolation, provide admin UI for configuring agents, policies, and quotas
Quality: Multi-tenant security with Row-Level Security, comprehensive error handling and retry logic, structured logging for debugging, test coverage for critical paths (148+ tests), production-ready monitoring (Sentry, Prometheus, Grafana)
Business: Reduce manual engineering effort on routine tasks, maintain code quality via human review process, control LLM costs via quotas and caching, scalable architecture to support future SaaS offering, help other companies facing same resource constraints

Non-Goals

Not Building: Fully autonomous system that auto-merges code (too risky), support for every programming language (focus on Go, JS, Python initially), complex CI/CD pipeline integration (just create PRs, let existing CI run), native mobile apps (web dashboard sufficient), real-time collaboration features (async workflow is fine)
Out of Scope: Replacing human engineers entirely (augmentation, not replacement), handling complex architectural decisions (focus on routine tasks), training custom LLM models (use Claude/GPT-4 via APIs)

🎯Strategy

Options Considered

Option A: Use Existing Tools (GitHub Copilot, Cursor): Leverage IDE assistants for code generation
Pros: Already built, proven UX, immediate availability Cons: IDE assistants require human to drive interaction. Don't solve autonomous issue processing problem.
Why not chosen: Not chosen: Doesn't address core requirement - need autonomous system that works without active human driver
Option B: Contract Development: Hire contractors to build the agent
Pros: External expertise, potentially faster initial development Cons: High upfront cost, ongoing maintenance dependencies, doesn't build internal capability
Why not chosen: Not chosen: Defeats purpose of building automation for capacity constraints. Need internal ownership and learning.
Option C: Build Custom Solution (CHOSEN): Build from scratch with complete control over pipeline
Pros: Complete control over AI pipeline, safety controls, and multi-tenant architecture. One-time development investment. Demonstrates engineering capability for portfolio. Cons: 1-2 month solo development time. Need to build everything from scratch. Ongoing maintenance responsibility.
Why not chosen: Chosen: Only option providing required autonomy, safety controls, and technical ownership.

Decision Rationale

Go + Next.js Stack: Go for high-performance backend (webhook processing, job orchestration, LLM API calls). Next.js for modern admin dashboard with Server Components (fast, secure, type-safe).
Dual-Schema Database: Cleanly separates Go backend data (public schema) from Next.js frontend data (aicd_admin schema). Prevents ORM collisions between database/sql and Prisma.
5-Stage AI Pipeline: Breaking workflow into specialized stages (triage → plan → generate → review → PR) provides quality control gates. Each stage can fail independently without corrupting downstream steps.
Multi-Tenant from Day One: Building SaaS architecture from start ensures scalability. Row-Level Security provides database-level isolation guarantees.
Human-in-the-Loop: Never auto-merge code. AI generates PRs, humans review and approve. Maintains quality standards, builds trust, essential for production use.

🚀Execution

Plan & Phases

Phase 1: Foundation & MVP: GitHub App setup, webhook handling, PostgreSQL with RLS, job queue, Claude AI integration, basic PR creation, Docker Compose. Deliverable: End-to-end MVP (webhook → PR).
Phase 2: Safety & Quality Controls: Multi-stage pipeline (triage → plan → generate → review → PR), policy engine, Sentry/Prometheus monitoring, test suite (148+ tests). Deliverable: Production-grade safety controls.
Phase 3: Multi-Tenant Platform: Dual-schema PostgreSQL, Next.js dashboard, NextAuth OAuth, Prisma multi-schema, Qdrant vector search. Deliverable: Full admin UI with tenant isolation.
Phase 4: Production Readiness: Plans/quotas, usage tracking, admin UI for policies/agents, documentation, legal templates, deployment to 1-5 early adopters. Deliverable: Production SaaS platform.

Timeline

Weeks 1-2: Foundation: GitHub App, webhook handler, database schema, job queue, initial Claude integration, PR creation, Docker Compose. Deliverable: End-to-end MVP (webhook → PR).
Weeks 2-3: Quality & Safety: Multi-stage pipeline, policy engine, error handling, monitoring (Sentry, Prometheus), test suite. Deliverable: Production-grade safety controls.
Weeks 3-4: Platform: Next.js dashboard, NextAuth authentication, dual-schema database, vector embeddings, semantic code search. Deliverable: Full admin UI with intelligence layer.
Weeks 4-6: Production: Plans/quotas, usage tracking, full admin UI, documentation, legal templates, deployment to early adopters. Deliverable: Production SaaS platform with 1-5 organizations.

Rollout & Risk Controls

Feature Flags: AI features can be disabled per-tenant, policy engine can block all or specific repos, manual approval required for new installations
Rate Limiting: Quota enforcement (actions per month), LLM token limits with soft/hard caps, webhook throttling to prevent abuse
Monitoring & Alerts: Sentry for error tracking (with context propagation), Prometheus metrics for system health, Slack/Discord notifications for critical events, daily quota usage summaries
Progressive Rollout: Started with internal use only (dogfooding), gradually added 1-5 early adopter organizations, collected feedback before broader release, human review of every PR before merge (built-in safety)
Rollback Procedures: Database migrations have rollback scripts, feature flags for instant disable, job queue can be paused/drained, GitHub App can be uninstalled cleanly
Cost Controls: Prompt caching to reduce LLM API calls, token counting before API calls (prevent surprises), monthly budget alerts per tenant, automatic pause if quota exceeded

🏗️Architecture

System Components

Backend (Go Service): 11 internal packages: api (HTTP handlers for webhooks/REST), db (database models, CRUD, transactions), github (GitHub API client), llm (Claude AI client with 4 agent types), embeddings (Qdrant integration, semantic search), orchestrator (Asynq job queue, worker pool, retry logic), policy (policy engine evaluation), quota (usage tracking, enforcement), notifications (multi-channel system), monitoring (Prometheus metrics - 30+), sentry (error tracking with context). Technologies: Go 1.24.3, Anthropic Claude SDK, GitHub API v57 with OAuth2 + JWT, Asynq (Redis-backed queue), PostgreSQL with database/sql, Qdrant vector database client.
Frontend (Next.js Dashboard): Next.js 15 with App Router and Turbopack, React 19.2.3 with Server Components (default), Client Components for interactive forms only, direct database queries via Prisma (no REST API layer for reads), Server Actions for mutations. Technologies: TypeScript 5.9 strict mode, Prisma 6.19.1 (multi-schema support), NextAuth.js 5.0.0-beta.30 (GitHub OAuth), Tailwind CSS 3.4 with custom theme, shadcn/ui components (20+ Radix primitives), React Hook Form 7.69.0 + Zod 3.25.76 validation, Recharts for analytics, pnpm 9.0.0.
Shared Infrastructure: PostgreSQL 14+ dual-schema (public schema for backend operational data with 12 migration files and Row-Level Security policies, aicd_admin schema for frontend configuration managed by Prisma migrations). Redis 7 job queue with Asynq (5 job types, exponential backoff retry, circuit breaker). Qdrant vector database (1024-dimensional embeddings, cosine similarity search, code indexing pipeline). Docker Compose for local dev (PostgreSQL 16-alpine, Redis 7-alpine, Qdrant, Prometheus, Grafana).

Data Flows

Ingress (GitHub Issue Created): GitHub webhook → Go backend /webhook/github endpoint → validate signature, extract issue details → check tenant exists and is active → enqueue triage job in Redis (Asynq) → return 200 OK to GitHub (fast webhook response)
Processing Stage 1 (Triage): Worker pulls job from queue → fetch issue details from database → evaluate against policy engine (labels, paths) → call Claude API (triage agent) to assess actionability → store triage result in database → if approved enqueue planning job, else mark rejected
Processing Stage 2 (Planning): Fetch repository code from GitHub → query Qdrant for relevant code context (semantic search on issue description) → call Claude API (planning agent) with issue + code context → parse structured implementation plan → store plan in database → enqueue code generation job
Processing Stage 3 (Code Generation): Fetch plan from database → call Claude API (generation agent) with plan + code context → parse JSON response with file modifications → store code diff in database → enqueue review job
Processing Stage 4 (Review): Fetch generated code diff → call Claude API (review agent) to self-assess quality/security → if issues found, can trigger regeneration (max 2 retries) → if approved, enqueue PR creation job → else mark job failed with review feedback
Processing Stage 5 (PR Creation): Create feature branch in GitHub → commit code changes from diff → open pull request with generated description → link PR to job in database → send notifications (Slack, Discord, GitHub comment) → mark job complete
Egress (Admin Dashboard): User logs in via NextAuth.js (GitHub OAuth) → session includes user ID, current org ID, role → Server Component fetches data from database (org-scoped queries from both public and aicd_admin schemas) → render dashboard with jobs, PRs, analytics → Client Components for interactive forms (agent config, policies) → Server Actions for mutations
Cross-Schema Data Flow: Frontend reads tenant config from aicd_admin.agents → backend reads agent config when processing jobs (cross-schema query) → frontend displays job status from public.jobs (read-only) → backend never writes to aicd_admin schema → frontend never writes to public schema (clear ownership)

🛡️Security

Threat Model

Primary Threats: Multi-Tenant Data Leaks (attacker accesses another organization's repos, jobs, or PRs; malicious tenant exfiltrates other tenants' data; admin user sees data outside their organization)
Code Injection Attacks: AI generates malicious code (backdoors, exfiltration), attacker crafts issue to trigger harmful code generation, supply chain attack via dependency injection
API Abuse & Cost Attacks: Unlimited LLM API calls drain budget, attacker floods system with webhook spam, quota bypass to exceed usage limits
Token & Credential Exposure: GitHub installation tokens leaked, user-provided API keys (BYOK) exposed, database credentials in logs or error messages
Unauthorized GitHub Access: Attacker creates PRs in repos without permission, malicious PR auto-merged without review, repo access beyond installation scope

Controls Implemented

Multi-Tenant Isolation (Database): Row-Level Security (RLS) policies on all tables with tenant_id foreign keys (NOT NULL constraint), PostgreSQL session context set on every connection (SET app.current_tenant_id), audit logs for all data access. Application Level: Authorization middleware checks org membership, resource access validated against current user's org, frontend filters all queries by organizationId, backend validates tenant_id on every request. Verification: Manual penetration testing (attempted cross-tenant access - all blocked), integration tests verify RLS enforcement, audit logs reviewed for unauthorized access attempts.
Code Generation Safety: Policy Engine (label allowlist only processes issues with specific labels like "ai-task", path filtering with glob patterns for allowed/denied file modifications, change limits for max files/lines changed per PR, approval requirements force PR review before merge). AI Pipeline Controls: Review stage self-assesses generated code for security issues, human review required before merge (no auto-merge), audit trail of every LLM call (prompt hash, response, cost), circuit breaker stops jobs after repeated failures. GitHub Safeguards: No force push capability, no auto-merge (PRs always require human approval), branch protection rules enforced, installation scope limits (only repos explicitly enabled).
API & Cost Controls: Quota Enforcement (monthly action limits per plan - Free: 100, Pro: 1000, Enterprise: unlimited; soft quota warnings at 80% usage; hard quota blocks new jobs when exceeded; per-tenant usage tracking in database). Rate Limiting: Webhook endpoint throttled (max 100/min per installation), LLM API calls counted and logged, Prometheus metrics track API usage, alerts on abnormal usage patterns. Cost Monitoring: Token counting before API calls (estimate cost), budget alerts per tenant (monthly spend), prompt caching reduces duplicate calls, automatic job pause if budget exceeded.
Token & Credential Security: GitHub Installation Tokens (short-lived 60 minutes max, generated on-demand per job, never persisted in database, scoped to single installation). User API Keys BYOK (encrypted at rest with AES-256-GCM, ENCRYPTION_SECRET environment variable, decrypted only when needed for LLM calls, never logged or exposed in errors). Database Credentials: Environment variables only (never hardcoded), different credentials for dev/prod, connection pooling with max limits, no credentials in logs or Sentry reports.
Audit & Observability: LLM Call Logging (every Claude API call recorded in llm_calls table with tenant_id, agent_type, provider, model, tokens_input, tokens_output, cost_usd, prompt_hash SHA-256 not full prompt for privacy). Policy Audit Trail: All policy evaluation results logged, rejected issues include rejection reason, failed policy checks tracked with details, admin UI shows policy decision history. Job Execution Audit: State transitions logged with timestamps, error messages captured (sanitized of secrets), correlation IDs link related logs, Sentry context includes tenant_id (never crosses tenant boundaries).

Verification

Testing: 148+ unit tests including security scenarios, integration tests for multi-tenant isolation, manual penetration testing (cross-tenant access attempts), policy engine test suite (70+ test cases)
Code Review: Security-focused self-review during development, AI review agent checks generated code, human review required for all PRs (including AI-generated)
Monitoring: Sentry alerts on authorization failures, Prometheus metrics track failed auth attempts, daily audit log reviews, quota violation alerts
Compliance Considerations: GDPR/CCPA alignment in Privacy Policy template, data retention policies documented, user data deletion workflow planned. ⚠️ Legal templates require lawyer review before production.

⚙️Operations

Observability

Key Metrics: Sentry (error tracking with environment dev/production separation, context propagation with tenant_id, job_id, correlation_id included in all events, breadcrumbs track function calls leading to errors, release tracking with Git SHA, user context includes Org ID never PII, performance monitoring with transaction tracing for slow endpoints)
Prometheus (30+ metrics): Job Metrics (aicd_jobs_total counter by status/tenant, aicd_job_duration_seconds histogram by stage, aicd_active_jobs gauge). LLM Metrics (aicd_llm_calls_total counter by provider/model/agent, aicd_llm_tokens_total counter by type input/output, aicd_llm_cost_usd_total counter by tenant/model, aicd_llm_duration_seconds histogram for API latency). GitHub Metrics (aicd_github_api_calls_total counter by endpoint, aicd_prs_created_total counter by tenant/repo, aicd_github_rate_limit_remaining gauge). Queue Metrics (aicd_queue_size gauge for pending jobs, aicd_queue_processing_time histogram for time in queue, aicd_worker_active gauge). Database Metrics (aicd_db_connections gauge for pool usage, aicd_db_query_duration histogram for query performance).
Grafana Dashboards (15 panels): Job pipeline overview (success rate, duration, volume), LLM cost tracking (daily spend, tokens, cost per job), GitHub API usage (rate limits, PR creation rate), queue health (backlog size, worker utilization), database performance (query time, connection pool)
Structured Logging: Correlation ID on every request (traces webhook → jobs → PR), context fields (tenant_id, job_id, issue_id, repo_id), log levels DEBUG (dev) INFO (prod) WARN ERROR, JSON format for parsing, no secrets in logs (sanitized)

Incident Response

Common Incidents: (1) LLM API Timeout - Detection: Sentry alert, Prometheus metric spike. Response: Job automatically retried (exponential backoff). Fallback: After 3 failures, circuit breaker opens (pause tenant jobs). Resolution: Manual investigation if persistent (check Anthropic status page).
(2) GitHub API Rate Limit - Detection: Prometheus aicd_github_rate_limit_remaining < 100. Response: Automatic job queuing until rate limit resets. Notification: Slack alert to admin. Prevention: Conditional requests (ETags), caching.
(3) Database Connection Pool Exhausted - Detection: Sentry "too many connections" errors. Response: Temporary request throttling, connection cleanup. Resolution: Increase pool size or optimize long-running queries. Prevention: Connection pool monitoring, query timeouts.
(4) Webhook Flood - Detection: Prometheus aicd_webhooks_total spike. Response: Rate limiting kicks in (max 100/min per tenant). Resolution: Investigate source (malicious vs. legitimate spike). Prevention: Webhook signature validation, tenant quotas.
(5) Queue Backlog - Detection: Prometheus aicd_queue_size > 100. Response: Scale worker pool (increase concurrency). Resolution: Process backlog, investigate cause (slow jobs, API issues). Prevention: Auto-scaling workers (planned for production).
On-Call Procedures: Sentry high-severity alerts go to Slack, daily summary of errors and metrics, weekly review of failed jobs and policy violations, monthly cost analysis per tenant

Cost Controls

LLM API Costs: Usage Tracking (token counting before API calls to estimate cost, every call logged in llm_calls table with cost_usd, daily/monthly spend calculated per tenant, Prometheus metrics for real-time cost tracking). Quota Enforcement (Free plan: 100 actions/month ~$5-15 LLM cost, Pro plan: 1,000 actions/month ~$50-150 LLM cost, Enterprise: Unlimited with budget alerts, soft quota warnings at 80%, hard quota blocks new jobs when exceeded). Optimization Strategies: Prompt caching for repeated contexts (e.g., repo README), batching process multiple small issues together (planned), model selection use Haiku for simple tasks Sonnet for complex, context pruning only send relevant code not entire repo.
Infrastructure Costs: Development (Docker Compose local $0, PostgreSQL local $0, Redis local $0, Qdrant local $0). Production Planned: Hetzner Cloud VPS $10-20/month, PostgreSQL managed DB $15-30/month, Redis managed instance $10/month, Qdrant Cloud $0 free tier then $50+/month. Total estimated: $35-110/month for infrastructure, LLM costs variable $100-500/month depending on usage.
Monitoring Costs: Sentry free tier (5k events/month, then $26/month), Prometheus self-hosted ($0), Grafana self-hosted ($0). Total: $0-26/month.
Total Operating Cost Estimate: Infrastructure $35-110/month + LLM API $100-500/month (variable, usage-dependent) + Monitoring $0-26/month = $135-636/month total estimated operating cost. LLM costs scale with usage; quota enforcement prevents runaway spending.

📊Results

Metric	Before	After	Notes
Production Deployment	No automation	1-5 early adopter organizations	⚠️ Limited production data (~2 months operation)
Issues Ingested (Labeled)	0	~50-100 labeled issues	Only explicitly labeled issues entered pipeline (e.g., ai-task label)
Eligible Issues	0	~42-90 eligible (85-90%)	Policy + triage enforced ticket hygiene; ~10-15% rejected for insufficient scope, missing acceptance criteria, or policy violations (would also block human engineers)
Pull Requests Created	0	~35-85 PRs generated	~85-95% of eligible issues → complete PR (code + tests + description). High success rate enabled by well-scoped, labeled issues with clear acceptance criteria.
Pull Requests Merged	0	~28-75 PRs merged	~80-90% merge rate after mandatory human review
Issue→PR Time	Days (manual)	~6-12 hours median	Significant time reduction for routine fixes
Transient Stage Errors	N/A	~10-15% of runs	LLM/API timeouts, GitHub rate limits, parsing; resolved via retry/backoff
Terminal Failures	N/A	Rare (~2-5%)	Jobs that ended without PR after max retries
Human Review	100% manual	100% PR review required	Zero auto-merge; mandatory human approval
Multi-Tenant Security	N/A	RLS + app-level isolation	Zero cross-tenant data leaks in testing/production
Observability	None	Sentry + Prometheus + Grafana	Production-grade monitoring from day one

Secondary Outcomes

Engineering Skill Development: Deep expertise in LLM integration and prompt engineering, production multi-tenant SaaS architecture experience, full-stack Go + Next.js proficiency, vector embeddings and semantic search implementation, advanced PostgreSQL (RLS, dual-schema patterns), production observability (Sentry, Prometheus, Grafana)
Portfolio Impact: Demonstrates ability to build complex systems solo, shows architectural decision-making and trade-offs, proves rapid execution capability (1-2 months to production), highlights modern tech stack expertise, evidence of production-ready engineering (security, testing, monitoring)
Reusability: Established a repeatable pattern for policy-controlled automation (multi-tenant, quotas, audit trails) applicable to other internal systems.
Team Impact: Engineering team freed from repetitive manual tasks, can focus on complex problems requiring human creativity, learning from AI-generated code patterns, maintained quality via PR review process
Long-Term Value: Reusable AI pipeline architecture for future projects, understanding of LLM costs and optimization strategies, multi-tenant SaaS patterns applicable to other ideas, production infrastructure and deployment experience

💡Lessons Learned

What Worked

Dual-Schema Database Pattern: The separation of public (backend) and aicd_admin (frontend) schemas was crucial. This pattern eliminated ORM collisions between Go's database/sql and Prisma, provided clear ownership boundaries (backend owns operational data, frontend owns config), enabled both services to access shared database efficiently, made cross-schema queries explicit and intentional. Reusable Principle: When building full-stack apps with different backend/frontend tech, consider schema separation instead of fighting ORM conflicts.
Multi-Stage AI Pipeline with Gates: Breaking the workflow into 5 distinct stages (triage → plan → generate → review → PR) rather than one monolithic "generate code" step provided quality control gates at each stage (can fail early, cheaply), better error handling (know exactly which stage failed), easier debugging (inspect intermediate outputs), cost optimization (don't generate code for rejected issues). Reusable Principle: AI agents should be pipelines, not single-shot. Each stage can validate, fail safely, and provide feedback to improve downstream steps.
Human-in-the-Loop from Day One: Requiring human review for all PRs (never auto-merge) was essential for building trust with users (safety net for AI mistakes), providing learning opportunity (see what AI generates), maintaining code quality standards, enabling gradual rollout without risk of catastrophic failures. Reusable Principle: For autonomous systems handling critical tasks (code, infrastructure, data), always include human oversight. Augmentation > full automation.
Production-Grade Observability from Start: Integrating Sentry, Prometheus, and Grafana from day one (not "we'll add monitoring later") was critical for catching bugs during development (not in production), enabling confident rollout (visibility into system health), making debugging vastly easier (correlation IDs, context propagation), providing cost visibility (LLM spend tracking). Reusable Principle: For solo engineers, monitoring is even more critical. You can't watch everything - instrumentation must watch for you.
Policy Engine for Safety: The policy engine (label filters, path filters, change limits) provided configurable safety boundaries without code changes, tenant control over AI behavior, defense against malicious issues crafted to trigger harmful code, audit trail of policy decisions. Reusable Principle: AI systems need configurable constraints. Business logic in database/config, not hardcoded.
Testing as Core Workflow (Not Afterthought): Writing 148+ tests during development (not after) ensured confidence in refactoring (didn't break existing features), faster debugging (tests pinpoint failures), documentation via test cases (shows intended behavior), solo engineer safety net (no QA team to catch bugs). Reusable Principle: For solo projects, tests replace the second pair of eyes. Write tests as you go, not "later."

What I Would Do Differently

Earlier User Feedback: Built Phase 1-3 (~3 weeks) before showing to any users. Assumed requirements, validated late. What I'd do differently: Deploy Phase 1 (basic webhook → PR) to internal testing immediately (week 1). Get feedback on UX, priorities, pain points earlier. Would have adjusted Phase 2-3 based on real usage, not assumptions. Learning: Even solo engineers benefit from early validation. Deploy crude but functional version ASAP.
More Granular Cost Estimation: Estimated LLM costs at project level ("~$100-500/month"). Didn't model per-job costs until Phase 5. Caused some anxiety about unexpected bills. What I'd do differently: Build cost estimation and tracking in Phase 1. Token counting, per-job cost attribution, budget alerts from day one. Would have informed model selection and caching strategy earlier. Learning: For LLM-heavy projects, cost tracking is infrastructure, not a feature. Build it first.
Database Migration Strategy Earlier: First 3-4 migrations were messy (schema changes, then realize need to add field, another migration). Didn't plan schema evolution carefully. What I'd do differently: Spend an extra day in Phase 0 designing full schema with room for growth. Fewer migrations, cleaner history. Consider Prisma for backend too (would simplify dual-schema coordination). Learning: Database schema is hard to change. Over-design upfront, especially for multi-tenant (adding columns later is painful when you have data).
Document Architectural Decisions as I Go: Made many architectural decisions (dual-schema, 5-stage pipeline, RLS approach) but only documented them in Phase 5 when writing this case study. Lost some context and rationale. What I'd do differently: Lightweight Architecture Decision Records (ADRs) after each major choice. Just 1-2 paragraphs: context, decision, consequences. Would make this case study easier to write, and help future me remember "why did I do this?" Learning: Solo engineers need documentation even more. You will forget your own decisions in 2 months.
Simpler Frontend Initially: Built full Next.js admin dashboard with Server Components, Prisma, NextAuth.js, complex UI. Took ~1.5 weeks (Phase 3). Could have shipped with simpler read-only dashboard or even CLI tool first. What I'd do differently: Phase 1-2 focus on backend pipeline. Simple CLI tool or read-only dashboard (just show job status). Phase 3+ add full admin UI once core pipeline proven. Learning: For solo engineers, sequence complexity carefully. Get core value prop working, then add UI polish.
Load Testing Earlier: Didn't load test until Phase 5. Discovered database connection pool sizing issue, slow queries on large repos, queue backlog under spike load. What I'd do differently: Simple load testing in Phase 2 (after pipeline works). Simulate 100 concurrent webhooks, large repos, quota edge cases. Would have caught performance issues earlier. Learning: Load testing doesn't require production scale. Even modest load tests reveal bottlenecks.

Playbook (Reusable Principles)

For Solo Engineers Building Complex Systems: (1) Start with Monitoring - Sentry + basic metrics on day one. You can't debug what you can't see. Correlation IDs in every log. (2) Test as You Go - Tests are your safety net. No QA team to catch bugs. Write tests during development, not after. (3) Ship Early, Iterate - Deploy crude-but-functional version ASAP. Real usage reveals wrong assumptions faster than planning. (4) Document Decisions - Lightweight ADRs. Future you will thank present you. Just 1-2 paragraphs per major choice. (5) Modular Architecture - Clean package boundaries. Easier to reason about, test, and refactor. 11 internal packages > 1 monolithic codebase. (6) Observability > Features - If you can't see it's broken, you can't fix it. Logs, metrics, alerts are infrastructure, not optional.
For AI/LLM Integration Projects: (1) Multi-Stage Pipelines - Break workflow into stages with quality gates. Fail early, fail cheaply. Don't generate code for invalid inputs. (2) Human-in-the-Loop - Never auto-commit AI output for critical tasks. Review step builds trust and maintains quality. (3) Cost Tracking from Day One - Token counting, per-job costs, budget alerts. LLM costs can spiral. Instrumentation prevents surprises. (4) Prompt Caching - Reuse common contexts (repo README, coding guidelines). Massive cost savings for repeated prompts. (5) Structured Output - Ask LLMs for JSON, not prose. Easier to parse, validate, test. Handle parsing failures gracefully. (6) Context is King - Semantic search (embeddings) > dumping entire codebase. Relevant context = better code, lower cost.
For Multi-Tenant SaaS: (1) Row-Level Security (RLS) - Database-level isolation is strongest guarantee. App-level checks as second layer, not primary defense. (2) Tenant ID Everywhere - Every table, every query, every log. Make it impossible to forget. Correlation in Sentry/logs. (3) Quota Enforcement - Soft warnings (80%), hard limits (100%), automatic pausing. Prevents abuse, controls costs. (4) Audit Everything - LLM calls, policy decisions, job state transitions. Accountability and debugging. (5) Short-Lived Tokens - Never persist GitHub installation tokens. Generate on-demand, use once, discard.
For Building Portfolio Projects: (1) Production-Ready > Feature-Rich - Better to have fewer features with monitoring, tests, docs than many features that break silently. (2) Document as You Build - This case study would be 10x harder to write after 6 months. Capture decisions, metrics, challenges in real-time. (3) Real Users, Even 1-5 - Shipping to real users (even friends) reveals issues demos won't. Validates you built something useful. (4) Measure Outcomes - Track metrics from start. "Reduced manual effort" is vague. "Automated 80% of routine bugs" is concrete. (5) Share Challenges, Not Just Success - Portfolio case studies showing "what I learned from failures" are more credible than "everything was perfect."

📦Artifacts

Selected Technical Details

// Multi-Stage AI Pipeline State Machine
type JobStatus string

const (
    JobStatusPending     JobStatus = "pending"
    JobStatusTriaging    JobStatus = "triaging"
    JobStatusPlanning    JobStatus = "planning"
    JobStatusGenerating  JobStatus = "generating"
    JobStatusReviewing   JobStatus = "reviewing"
    JobStatusCreatingPR  JobStatus = "creating_pr"
    JobStatusCompleted   JobStatus = "completed"
    JobStatusFailed      JobStatus = "failed"
    JobStatusRejected    JobStatus = "rejected"
)

// Triage worker (stage 1) - Quality gate before expensive operations
func (w *TriageWorker) Process(ctx context.Context, task *asynq.Task) error {
    var payload TriagePayload
    if err := json.Unmarshal(task.Payload(), &payload); err != nil {
        return fmt.Errorf("unmarshal payload: %w", err)
    }

    // Update job status to "triaging"
    if err := w.db.UpdateJobStatus(ctx, payload.JobID, JobStatusTriaging); err != nil {
        return err
    }

    // Evaluate against policy engine (labels, paths, change limits)
    decision, err := w.policy.Evaluate(ctx, issue)
    if err != nil {
        return err
    }

    if !decision.Allowed {
        // Policy rejected - mark job as rejected (fail early, no LLM cost)
        w.db.UpdateJobStatus(ctx, payload.JobID, JobStatusRejected)
        w.db.UpdateIssueTriageStatus(ctx, issue.ID, "rejected", decision.Reason)
        return nil // Not an error, just rejected
    }

    // Call Claude AI for triage
    triageResult, err := w.llm.Triage(ctx, issue)
    if err != nil {
        return fmt.Errorf("llm triage: %w", err)
    }

    if !triageResult.IsActionable {
        // AI rejected - mark as rejected
        w.db.UpdateJobStatus(ctx, payload.JobID, JobStatusRejected)
        w.db.UpdateIssueTriageStatus(ctx, issue.ID, "rejected", triageResult.Reason)
        return nil
    }

    // Approved - enqueue planning job
    w.db.UpdateIssueTriageStatus(ctx, issue.ID, "approved", triageResult.Reason)
    planningTask := asynq.NewTask("planning", payload)
    w.queue.Enqueue(ctx, planningTask)

    return nil
}

// Row-Level Security for Multi-Tenant Isolation
// Database-level enforcement prevents application bugs from leaking data
CREATE POLICY tenant_isolation ON issues
  USING (tenant_id = current_setting('app.current_tenant_id')::int);

// Every database connection sets tenant context
func (db *DB) SetTenantContext(ctx context.Context, tenantID int) error {
    query := fmt.Sprintf("SET app.current_tenant_id = %d", tenantID)
    _, err := db.pool.Exec(ctx, query)
    return err
}