2026.04.23 operations

What it actually takes to run an AI platform solo

100+ endpoints, agent orchestration, multiple active projects, zero team. The operational reality. What systems make this possible. What breaks regularly. What I'd never do again.

platformoperationssolo

Running an AI platform solo is like being a one-person orchestra. Everything has to work together, and when something breaks, the music stops.

Here’s the operational reality of maintaining 100+ endpoints, orchestrating AI agents, and shipping multiple products with no team.

The Numbers

neatworld platform:

100+ FastAPI endpoints
95+ MCP tools for Claude integration
8 AI agents with full tool access
PostgreSQL database with 47 tables
5 active projects in different states
1 person maintaining it all

Infrastructure:

Railway (backend hosting)
Mac Mini (InDesign render node)
Tailscale (networking)
GitHub Actions (CI/CD)
Claude API (95% of AI calls)
Gemini Flash (cost optimization)

Daily reality:

~200 API calls from active experiments
3-5 deployment cycles per day
12+ hours of system monitoring
Zero buffer for extended downtime

What Makes This Possible

1. Opinionated Defaults Everywhere

I don’t have time to evaluate every technical decision. Strong defaults with escape hatches:

FastAPI over Flask. Better async support, automatic OpenAPI docs, Pydantic validation.

PostgreSQL over everything. Don’t care about vendor lock-in. Postgres does what I need.

Railway over AWS. Deployment is railway up. No infrastructure management.

Claude over GPT. Tool use is cleaner, function calling more reliable.

# This is my entire deployment script
def deploy():
    subprocess.run(["railway", "up", "/tmp/neatworld-deploy", "--detach"])

2. Aggressive Automation

Everything that happens more than twice gets automated:

Database migrations. Alembic handles schema changes. Never touch production DB directly.

Deployment pipeline. Push to main → tests run → deploy to Railway. No manual steps.

Monitoring alerts. When something breaks, I get a Slack notification with enough context to fix it.

Backup routines. Database dumps to Dropbox every 6 hours. File assets to cloud storage daily.

Health checks. Every service has a /health endpoint. Uptime Robot monitors them all.

3. Simple Architecture, Sophisticated Tooling

The core architecture is boring: FastAPI monolith with PostgreSQL. No microservices, no event sourcing, no distributed systems.

But the tooling is sophisticated:

claude-code-sdk integration. Bidirectional streaming with in-process MCP server. AI agents have access to the full platform.

Agent orchestration. 8 specialized agents that can read files, write code, analyze data, run bash commands. Each with specific personas and constraints.

Pipeline system. Predefined agent chains for complex workflows. Research → planning → execution → review.

Cost guard. Budget tracking for all AI API calls. Hard limits prevent runaway costs.

What Breaks Regularly

1. Claude API Rate Limits

When I’m actively developing, I hit rate limits fast. Multiple agents calling Claude simultaneously, each with 15-turn conversations.

Solution: Request queuing with exponential backoff. Failed requests retry automatically. Non-critical calls (like blog generation) get deprioritized.

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def call_claude_with_retry(prompt, tools=None):
    try:
        return await claude_client.query(prompt, tools)
    except RateLimitError:
        await asyncio.sleep(60)  # Back off more aggressively
        raise

2. Database Connection Exhaustion

FastAPI + async + SQLAlchemy can leak connections under load. Especially with long-running agent conversations that open DB sessions and forget to close them.

Solution: Connection pooling with aggressive timeouts. Every database operation is wrapped in try/finally blocks that ensure cleanup.

3. Mac Mini Render Pipeline

The InDesign automation breaks constantly:

Memory leaks require nightly restarts
File locking when scripts crash
Adobe licensing issues after macOS updates
AppleScript changes between InDesign versions

Solution: Comprehensive error handling and automatic recovery. Failed renders retry once, then fallback to web-based PDF generation (uglier but reliable).

4. Agent Context Overflow

AI agents with tool access can generate massive conversation histories. 50+ tool calls in a single conversation. Context windows fill up, performance degrades.

Solution: Smart context management. Compress old tool results, summarize conversation history, rotate context when approaching limits.

Solo Development Patterns

1. Ship Small, Iterate Fast

I can’t build comprehensive features upfront. Ship the minimum viable version, watch how it’s used, then improve.

BooDo started as a simple task list. The nudge system, streak tracking, and couple dynamics came from real usage patterns.

2. Build Systems, Not Features

Every new feature becomes part of a larger system. The PDF pipeline started for weekly reports, now powers brand.ai deliverables and project documentation.

Agent tools get reused across multiple agents. The calendar tool works for scheduling, planning, and briefing generation.

3. Document Everything

I’m the only person who knows how this works. If I get sick for a week, nothing gets fixed.

Solution: Obsessive documentation. README files for every module. Architecture decision records. Runbook for common failures.

The CLAUDE.md file in the repo is 1,200+ lines of context. Every agent, every endpoint, every deployment process.

4. Monitor Business Metrics, Not Just System Metrics

Uptime doesn’t matter if the AI agents are giving bad advice. Response time is irrelevant if the cost per query is unsustainable.

Metrics that actually matter:

Claude API success rate (target: 95%+)
Average cost per user interaction
Agent task completion rate
Pipeline success rate (end-to-end)
User engagement with AI-generated content

Cost Management

Running an AI platform solo means I pay for every API call personally. Cost optimization isn’t optional.

Three-tier routing:

Ollama (local, free) for simple tasks and monitoring
Gemini Flash (~$15/month) for planning and analysis
Claude for complex reasoning and tool use

Request classification: Route requests to the cheapest model that can handle them. Only escalate to Claude when necessary.

Caching strategies: Identical prompts get cached responses. Similar prompts get template-based responses with variable substitution.

Budget alerts: Hard stops when monthly spend hits $100. The platform degrades gracefully to free/local models only.

What I’d Never Do Again

1. Multiple Database Connections

Early architecture had separate database connections for different services. Connection pool exhaustion, transaction coordination nightmare, debugging hell.

Better: Single database connection pool, shared across all services. Simpler, more reliable, easier to monitor.

2. Complex Agent Handoffs

Original pipeline system had agents passing work to each other with complex state management. Debugging conversation chains was impossible.

Better: Linear pipelines with simple input/output contracts. Agent A finishes, passes structured output to Agent B.

3. Feature Flags for Core Infrastructure

I used feature flags to control whether agents were enabled, which database to use, how to handle authentication. Core systems shouldn’t be configurable.

Better: Feature flags for user-facing features only. Infrastructure decisions are permanent until explicitly migrated.

4. Building My Own User Management

Authentication, password resets, session management, email verification. None of this is differentiating, all of it breaks constantly.

Better: Use Clerk or Auth0 from day one. Focus on the unique value, not commodity infrastructure.

Current Bottlenecks

1. Me. I’m the limiting factor. Code reviews, deployment approvals, architecture decisions, user support. Can’t scale past my own capacity.

2. Claude API costs. Usage grows faster than revenue. Need smarter routing and caching to make unit economics work.

3. Database performance. Single PostgreSQL instance. Complex queries slow down as data grows. Need read replicas and query optimization.

4. Agent reliability. AI agents fail in unpredictable ways. Need better error handling, retry logic, and fallback strategies.

What’s Working

1. Aggressive simplicity. Boring tech choices, simple architecture, minimal dependencies. When something breaks, it’s usually obvious why.

2. AI-first development. Using Claude agents for research, planning, and code generation. I build faster because I’m not building alone.

3. Platform thinking. Every project benefits from shared infrastructure. The PDF pipeline, agent system, and database serve multiple products.

4. Real usage. I use every feature I build. BooDo manages my own tasks, neatworld handles my own schedule, brand.ai analyzes my own design work.

The Reality

Running an AI platform solo is exhausting and exhilarating. Everything breaks, but everything is fixable. You have no team, but you have AI agents that never sleep.

The operational burden is high, but the iteration speed is unmatched. No committees, no approval processes, no coordination overhead. Idea to production in hours, not weeks.

It’s not sustainable long-term. But it’s the fastest way to figure out what actually works.

When something breaks at 2 AM, you fix it. When users report a bug, you deploy the fix in 10 minutes. When you have an idea, you ship it before lunch.

That’s the tradeoff. Total responsibility for total control.

Some days it’s worth it.