Infrastructure
Neon
Agent Ops platform that combines durable workflow execution (Temporal) with real-time observability (ClickHouse) to make AI agents reliable enough for production.
Define Stage
Declarative agent configuration: system prompts, tool schemas, guardrails, and evaluation criteria. Version-controlled alongside code.
Execute Stage
Durable workflow execution via Temporal. Agent runs survive crashes, rate limits, and restarts. Every tool call is a recoverable checkpoint.
Observe Stage
Full OpenTelemetry instrumentation: every LLM call, tool invocation, and decision point emits structured traces and spans.
Evaluate Stage
Async evaluation workers score agent outputs against defined criteria. Runs factuality, relevance, and safety checks in parallel.
Optimize Stage
Feedback from evaluations drives prompt optimization, tool selection refinement, and guardrail tuning. Closes the improvement loop.
Agent Runtime
LLM-powered agent with structured tool access. Supports multi-turn conversations, parallel tool calls, and streaming responses.
OTel Collector
OpenTelemetry collector aggregates traces, metrics, and logs from all agent instances. Batches and routes to downstream storage.
Ingestion API
High-throughput API gateway that validates, transforms, and routes telemetry data. Handles backpressure with async buffering.
ClickHouse
Columnar analytics database optimized for time-series trace data. Sub-second queries over billions of spans with materialized views.
Real-time Dashboard
Live monitoring of agent performance: latency percentiles, token usage, error rates, and evaluation scores. Drill into individual traces.
Temporal Workers
Durable execution engine. Workflows persist state across failures — if an agent crashes mid-tool-call, it resumes exactly where it left off.
Evaluation Workers
Distributed eval jobs that score agent outputs. Support custom rubrics, factuality checks, and regression testing against golden datasets.
Pipeline Stages
Declarative agent configuration: system prompts, tool schemas, guardrails, and evaluation criteria.
Durable workflow execution via Temporal. Agent runs survive crashes, rate limits, and restarts.
Full OpenTelemetry instrumentation: every LLM call, tool invocation, and decision point emits structured traces.
Async evaluation workers score agent outputs against defined criteria with LLM-as-Judge rubrics.
Feedback from evaluations drives prompt optimization and tool selection refinement.
Data Flow
LLM-powered agent with structured tool access, multi-turn conversations, and streaming.
Columnar analytics DB for time-series trace data. Sub-second queries over billions of spans.
Workflows persist state across failures — agents resume exactly where they left off after crashes.
The Problem
AI agents fail in ways that are hard to debug and impossible to reproduce.
A customer reports that an agent gave a wrong answer. You check the logs—nothing. The conversation history is gone, the intermediate tool calls vanished, and you’re left guessing which of the 47 LLM calls went sideways. Even if you find the issue, you can’t replay it because the agent hit a rate limit, crashed mid-execution, or the external API returned something different.
Traditional observability tools (Langfuse, Braintrust) help you see failures after they happen. But agents need something different:
- Durability — When an agent crashes at step 15 of 20, it should resume from step 15, not restart.
- Time-travel debugging — Replay the exact sequence of events that led to a failure.
- Systematic evaluation — Catch regressions before deploy, not after customers complain.
That’s what Neon does.
How It Works
Neon operates in two modes depending on your constraints:
Mode 1: Observe-Only (Bring Your Own Agent)
Your agents run wherever they run—Lambda, Cloud Run, K8s. You instrument them with OpenTelemetry, and Neon ingests the traces:
from opentelemetry import trace
tracer = trace.get_tracer("my-agent")
@tracer.start_as_current_span("agent-run")
async def run_agent(query: str):
# Your existing agent code, unchanged
response = await llm.generate(query)
return response
Every LLM call, tool invocation, and retrieval gets captured with full inputs/outputs. ClickHouse stores it all with sub-100ms query latency, even at millions of traces.
Mode 2: Managed Execution (Temporal)
For agents that need durability, run them inside Neon’s Temporal workflows:
export async function agentWorkflow(params: AgentInput) {
// Step 1: Call LLM
const plan = await llmCall({ model: 'claude-3-5-sonnet', messages });
// Step 2: Execute tools (each is durable)
for (const tool of plan.tools) {
await executeToolActivity(tool); // Survives crashes
}
// Step 3: Human approval gate (optional)
if (params.requiresApproval) {
await condition(() => approvalReceived, '7 days');
}
return result;
}
If the process crashes at step 2, Temporal resumes from exactly where it left off. Rate limited? It retries with backoff. Need human approval? The workflow pauses for up to 7 days and resumes when approved.
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ NEON PLATFORM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ DEFINE │────▶│ EXECUTE │────▶│ OBSERVE │────▶│ EVALUATE │ │
│ │ │ │ │ │ │ │ │ │
│ │ • Agents │ │ • Temporal │ │ • ClickHouse│ │ • SDK │ │
│ │ • Test cases│ │ • Workers │ │ • Real-time │ │ • Scorers │ │
│ │ • Scorers │ │ • Durable │ │ • OTel │ │ • CI/CD │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ └───────────────────┴───────────────────┴───────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ OPTIMIZE │ │
│ │ │ │
│ │ • A/B compare │ │
│ │ • Regression │ │
│ │ • Insights │ │
│ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Data Flow:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────┐
│ Your Agent │─OTel─│ /api/traces │─────▶│ ClickHouse │─────▶│Dashboard│
│ (anywhere) │ │ ingestion │ │ (storage) │ │ (UI) │
└──────────────┘ └──────────────┘ └──────────────┘ └─────────┘
│
▼
┌──────────────┐ ┌──────────────┐
│ Temporal │◀────▶│ Eval Workers │
│ (durable) │ │ (scorers) │
└──────────────┘ └──────────────┘
Why These Technologies
ClickHouse for Trace Storage
The problem with trace storage is scale. A single agent run might generate 50+ spans, each with full LLM inputs/outputs (easily 10KB+ per span). At 1000 runs/day, you’re storing 500MB of trace data daily. Traditional databases choke.
ClickHouse handles this because:
- Columnar storage: Queries that filter by
project_idandtimestamponly read those columns, not the massiveinput/outputtext fields. - Compression: 10-20x compression on text data. That 500MB becomes 25-50MB on disk.
- Materialized views: Pre-aggregate daily stats, score trends, and percentiles. Dashboard queries hit small summary tables, not raw data.
- Skip indexes: Bloom filters on
trace_idmean direct lookups skip 99% of data blocks.
The schema is designed for agent-specific queries:
CREATE TABLE traces (
project_id String,
trace_id String,
timestamp DateTime64(3),
status Enum8('ok' = 1, 'error' = 2),
-- Agent context
agent_id String,
agent_version String,
workflow_id String,
-- Aggregates (updated on completion)
total_tokens UInt64,
total_cost Decimal(12, 6),
llm_calls UInt16,
tool_calls UInt16,
INDEX idx_trace_id trace_id TYPE bloom_filter(0.01) GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (project_id, timestamp, trace_id)
TTL timestamp + INTERVAL 90 DAY;
Temporal for Durable Execution
Temporal isn’t just “a job queue.” It’s a fundamentally different execution model.
Traditional approach:
Agent starts → Calls LLM → Process crashes → Agent restarts from beginning
Temporal approach:
Agent starts → Calls LLM (recorded) → Process crashes →
New worker picks up → Replays history → Resumes from exact position
This enables patterns that are impossible otherwise:
Human-in-the-loop: Pause a workflow for human approval, resume days later.
await condition(() => humanApproved, '7 days');
Automatic retry with backoff: Rate limited? Temporal retries with exponential backoff.
const result = await llmCall(input, {
retry: { maxAttempts: 5, backoffCoefficient: 2 }
});
Long-running agents: Workflows can run for hours or days without holding connections open.
Next.js for API + UI
The frontend is a Next.js 15 app that handles both the dashboard UI and API routes. Why not separate services?
- Simpler deployment: One container, one deploy, one set of environment variables.
- Type sharing: TypeScript types flow from ClickHouse schema → API routes → React components.
- React 19 features: Server components for initial data fetching, streaming for trace viewer.
The API routes are thin—mostly query builders that translate REST params to ClickHouse SQL:
// app/api/traces/route.ts
export async function GET(request: Request) {
const { searchParams } = new URL(request.url);
const traces = await queryTraces({
projectId: searchParams.get('project_id'),
status: searchParams.get('status'),
startDate: searchParams.get('start'),
endDate: searchParams.get('end'),
limit: parseInt(searchParams.get('limit') || '50'),
});
return Response.json(traces);
}
The Evaluation System
Most “eval frameworks” are glorified pytest wrappers. Neon’s evals are designed for agent-specific challenges.
Scorers
Agents aren’t just “right” or “wrong.” They make sequences of decisions:
| Scorer | Question It Answers |
|---|---|
ToolSelectionScorer | Did the agent pick the right tools? |
ReasoningQualityScorer | Is the chain-of-thought coherent? |
GroundingScorer | Are claims supported by tool outputs? |
TerminationScorer | Did it stop at the right time? |
EfficiencyScorer | Were there unnecessary steps? |
Each scorer can use either deterministic checks (expected tools = actual tools) or LLM judges (rate this reasoning 0-1).
Regression Detection
The killer feature: comparing agent versions systematically.
# Run evals on PR branch
npx neon eval --suite core-tests --version pr-123
# Compare to main
npx neon compare --baseline main --candidate pr-123 --threshold 0.95
If any scorer drops more than 5%, the PR check fails. No more “ship and pray.”
CI/CD Integration
GitHub Action that gates deploys:
- name: Agent Quality Check
uses: neon/eval-action@v1
with:
suite: core-tests
agent-path: ./src/agent.py
threshold: '0.95'
fail-on-regression: 'true'
What Makes This Different
| Capability | Langfuse | Braintrust | Neon |
|---|---|---|---|
| Trace Collection | ✅ | ✅ | ✅ |
| OTel Ingestion | ✅ | ❌ | ✅ |
| Evaluation | ✅ | ✅ | ✅ |
| Durable Execution | ❌ | ❌ | ✅ Temporal |
| Time-Travel Debug | ❌ | ❌ | ✅ |
| Human-in-the-Loop | ❌ | ❌ | ✅ |
| Self-Hosted | ✅ | ❌ | ✅ |
| Sub-100ms Queries | ❌ | ❌ | ✅ ClickHouse |
The fundamental difference: Langfuse and Braintrust are observability tools. Neon is an operations platform. It doesn’t just show you what happened—it gives you control over execution.
Project Structure
neon/
├── frontend/ # Next.js 15 (UI + API)
│ ├── app/
│ │ ├── api/ # REST endpoints
│ │ │ ├── traces/ # Trace CRUD + ingestion
│ │ │ ├── spans/ # Span queries
│ │ │ ├── scores/ # Score management
│ │ │ └── dashboard/ # Aggregations
│ │ ├── traces/ # Trace viewer UI
│ │ ├── workflows/ # Temporal workflow UI
│ │ └── compare/ # A/B comparison
│ └── lib/
│ ├── clickhouse.ts # Query builders (800 LOC)
│ └── temporal.ts # Workflow client
│
├── temporal-workers/ # Durable execution
│ └── src/
│ ├── workflows/ # Agent + eval workflows
│ └── activities/ # LLM calls, tools
│
├── packages/sdk/ # @neon/sdk
│ ├── scorers/ # Built-in scorers
│ ├── test/ # defineTest, datasets
│ └── cli/ # npx neon eval
│
├── scripts/
│ └── clickhouse-init.sql # Schema (400 LOC)
│
└── docker-compose.yml # One-command local setup
Tech Stack
| Layer | Choice | Why |
|---|---|---|
| Language | TypeScript | End-to-end type safety, from ClickHouse schema to React props |
| Frontend | Next.js 15, React 19 | Server components, streaming, single deploy unit |
| Trace Storage | ClickHouse | Sub-100ms on millions of rows, 10x compression |
| Orchestration | Temporal | True durability, not just retries |
| Metadata | PostgreSQL | Projects, configs, user settings |
| Streaming | Redpanda (optional) | High-throughput ingestion when needed |
| Infra | Docker Compose | docker compose up and you’re running |
Current Status
What’s built:
- ClickHouse trace/span/score storage with materialized views
- Next.js dashboard with trace viewer, span details, score trends
- API routes for ingestion and queries
- Docker Compose with all infrastructure
- Lazy loading for large traces (PERF-004)
- Dashboard aggregations with percentiles
What’s next:
- Temporal eval workflows (in progress)
- @neon/sdk package with scorers
- GitHub Action for CI gating
- A/B comparison UI
- Dataset management
Try It
git clone https://github.com/Sean-Koval/neon.git
cd neon
# Start everything
docker compose up -d
# Open dashboard
open http://localhost:3000
# Send a test trace
curl -X POST http://localhost:3000/api/traces/ingest \
-H "Content-Type: application/json" \
-H "x-project-id: demo" \
-d '{"trace_id": "test-001", "name": "my-agent", "status": "ok"}'
Links
- Repository: github.com/Sean-Koval/neon