Infrastructure

Neon

Agent Ops platform that combines durable workflow execution (Temporal) with real-time observability (ClickHouse) to make AI agents reliable enough for production.

Define Stage

Declarative agent configuration: system prompts, tool schemas, guardrails, and evaluation criteria. Version-controlled alongside code.

YAML ConfigTypeScript SDK

Execute Stage

Durable workflow execution via Temporal. Agent runs survive crashes, rate limits, and restarts. Every tool call is a recoverable checkpoint.

TemporalWorkflow Replay

Observe Stage

Full OpenTelemetry instrumentation: every LLM call, tool invocation, and decision point emits structured traces and spans.

OpenTelemetryStructured Logging

Evaluate Stage

Async evaluation workers score agent outputs against defined criteria. Runs factuality, relevance, and safety checks in parallel.

LLM-as-JudgeRubric Scoring

Optimize Stage

Feedback from evaluations drives prompt optimization, tool selection refinement, and guardrail tuning. Closes the improvement loop.

DSPyA/B Testing

Agent Runtime

LLM-powered agent with structured tool access. Supports multi-turn conversations, parallel tool calls, and streaming responses.

TypeScriptOpenAI / Anthropic

OTel Collector

OpenTelemetry collector aggregates traces, metrics, and logs from all agent instances. Batches and routes to downstream storage.

OTLP gRPCBatch Processor

Ingestion API

High-throughput API gateway that validates, transforms, and routes telemetry data. Handles backpressure with async buffering.

HonoAsync Queues

ClickHouse

Columnar analytics database optimized for time-series trace data. Sub-second queries over billions of spans with materialized views.

MergeTreeMaterialized Views

Real-time Dashboard

Live monitoring of agent performance: latency percentiles, token usage, error rates, and evaluation scores. Drill into individual traces.

ReactRecharts

Temporal Workers

Durable execution engine. Workflows persist state across failures — if an agent crashes mid-tool-call, it resumes exactly where it left off.

Temporal.ioActivity Retry

Evaluation Workers

Distributed eval jobs that score agent outputs. Support custom rubrics, factuality checks, and regression testing against golden datasets.

Async WorkersGolden Sets

Pipeline Stages

Declarative agent configuration: system prompts, tool schemas, guardrails, and evaluation criteria.

YAML ConfigTypeScript SDK

Durable workflow execution via Temporal. Agent runs survive crashes, rate limits, and restarts.

TemporalWorkflow Replay

Full OpenTelemetry instrumentation: every LLM call, tool invocation, and decision point emits structured traces.

OpenTelemetryStructured Logging

Async evaluation workers score agent outputs against defined criteria with LLM-as-Judge rubrics.

LLM-as-JudgeRubric Scoring

Feedback from evaluations drives prompt optimization and tool selection refinement.

DSPyA/B Testing

Data Flow

LLM-powered agent with structured tool access, multi-turn conversations, and streaming.

TypeScriptOpenAI / Anthropic

Columnar analytics DB for time-series trace data. Sub-second queries over billions of spans.

MergeTreeMaterialized Views

Workflows persist state across failures — agents resume exactly where they left off after crashes.

Temporal.ioActivity Retry

The Problem

AI agents fail in ways that are hard to debug and impossible to reproduce.

A customer reports that an agent gave a wrong answer. You check the logs—nothing. The conversation history is gone, the intermediate tool calls vanished, and you’re left guessing which of the 47 LLM calls went sideways. Even if you find the issue, you can’t replay it because the agent hit a rate limit, crashed mid-execution, or the external API returned something different.

Traditional observability tools (Langfuse, Braintrust) help you see failures after they happen. But agents need something different:

Durability — When an agent crashes at step 15 of 20, it should resume from step 15, not restart.
Time-travel debugging — Replay the exact sequence of events that led to a failure.
Systematic evaluation — Catch regressions before deploy, not after customers complain.

That’s what Neon does.

How It Works

Neon operates in two modes depending on your constraints:

Mode 1: Observe-Only (Bring Your Own Agent)

Your agents run wherever they run—Lambda, Cloud Run, K8s. You instrument them with OpenTelemetry, and Neon ingests the traces:

from opentelemetry import trace

tracer = trace.get_tracer("my-agent")

@tracer.start_as_current_span("agent-run")
async def run_agent(query: str):
    # Your existing agent code, unchanged
    response = await llm.generate(query)
    return response

Every LLM call, tool invocation, and retrieval gets captured with full inputs/outputs. ClickHouse stores it all with sub-100ms query latency, even at millions of traces.

Mode 2: Managed Execution (Temporal)

For agents that need durability, run them inside Neon’s Temporal workflows:

export async function agentWorkflow(params: AgentInput) {
  // Step 1: Call LLM
  const plan = await llmCall({ model: 'claude-3-5-sonnet', messages });

  // Step 2: Execute tools (each is durable)
  for (const tool of plan.tools) {
    await executeToolActivity(tool);  // Survives crashes
  }

  // Step 3: Human approval gate (optional)
  if (params.requiresApproval) {
    await condition(() => approvalReceived, '7 days');
  }

  return result;
}

If the process crashes at step 2, Temporal resumes from exactly where it left off. Rate limited? It retries with backoff. Need human approval? The workflow pauses for up to 7 days and resumes when approved.

Architecture

The interactive diagram above illustrates the full pipeline — from agent definition through execution, observation, evaluation, and optimization — along with the data flow from your agents through OTel ingestion into ClickHouse and the real-time dashboard.

Why These Technologies

ClickHouse for Trace Storage

The problem with trace storage is scale. A single agent run might generate 50+ spans, each with full LLM inputs/outputs (easily 10KB+ per span). At 1000 runs/day, you’re storing 500MB of trace data daily. Traditional databases choke.

ClickHouse handles this because:

Columnar storage: Queries that filter by project_id and timestamp only read those columns, not the massive input/output text fields.
Compression: 10-20x compression on text data. That 500MB becomes 25-50MB on disk.
Materialized views: Pre-aggregate daily stats, score trends, and percentiles. Dashboard queries hit small summary tables, not raw data.
Skip indexes: Bloom filters on trace_id mean direct lookups skip 99% of data blocks.

The schema is designed for agent-specific queries:

CREATE TABLE traces (
    project_id String,
    trace_id String,
    timestamp DateTime64(3),
    status Enum8('ok' = 1, 'error' = 2),

    -- Agent context
    agent_id String,
    agent_version String,
    workflow_id String,

    -- Aggregates (updated on completion)
    total_tokens UInt64,
    total_cost Decimal(12, 6),
    llm_calls UInt16,
    tool_calls UInt16,

    INDEX idx_trace_id trace_id TYPE bloom_filter(0.01) GRANULARITY 4
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (project_id, timestamp, trace_id)
TTL timestamp + INTERVAL 90 DAY;

Temporal for Durable Execution

Temporal isn’t just “a job queue.” It’s a fundamentally different execution model.

Traditional approach:

Agent starts → Calls LLM → Process crashes → Agent restarts from beginning

Temporal approach:

Agent starts → Calls LLM (recorded) → Process crashes →
New worker picks up → Replays history → Resumes from exact position

This enables patterns that are impossible otherwise:

Human-in-the-loop: Pause a workflow for human approval, resume days later.

await condition(() => humanApproved, '7 days');

Automatic retry with backoff: Rate limited? Temporal retries with exponential backoff.

const result = await llmCall(input, {
  retry: { maxAttempts: 5, backoffCoefficient: 2 }
});

Long-running agents: Workflows can run for hours or days without holding connections open.

Next.js for API + UI

The frontend is a Next.js 15 app that handles both the dashboard UI and API routes. Why not separate services?

Simpler deployment: One container, one deploy, one set of environment variables.
Type sharing: TypeScript types flow from ClickHouse schema → API routes → React components.
React 19 features: Server components for initial data fetching, streaming for trace viewer.

The API routes are thin—mostly query builders that translate REST params to ClickHouse SQL:

// app/api/traces/route.ts
export async function GET(request: Request) {
  const { searchParams } = new URL(request.url);

  const traces = await queryTraces({
    projectId: searchParams.get('project_id'),
    status: searchParams.get('status'),
    startDate: searchParams.get('start'),
    endDate: searchParams.get('end'),
    limit: parseInt(searchParams.get('limit') || '50'),
  });

  return Response.json(traces);
}

The Evaluation System

Most “eval frameworks” are glorified pytest wrappers. Neon’s evals are designed for agent-specific challenges.

Scorers

Agents aren’t just “right” or “wrong.” They make sequences of decisions:

Scorer	Question It Answers
`ToolSelectionScorer`	Did the agent pick the right tools?
`ReasoningQualityScorer`	Is the chain-of-thought coherent?
`GroundingScorer`	Are claims supported by tool outputs?
`TerminationScorer`	Did it stop at the right time?
`EfficiencyScorer`	Were there unnecessary steps?

Each scorer can use either deterministic checks (expected tools = actual tools) or LLM judges (rate this reasoning 0-1).

Regression Detection

The killer feature: comparing agent versions systematically.

PR Opened feature branch

Run Evals core-tests suite

Compare vs. baseline (main)

Quality Gate ≥ 95% all scorers

Deploy merge to main

Regression detected — fix & re-push

PR Opened Feature branch pushed

Run Eval Suite All scorers execute against PR branch

Compare to Baseline Score each dimension vs. main branch

Quality Gate ≥ 95% on all scorers to pass

Deploy Merge to main

Regression? Fix & re-push → back to step 1

# Run evals on PR branch
npx neon eval --suite core-tests --version pr-123

# Compare to main
npx neon compare --baseline main --candidate pr-123 --threshold 0.95

If any scorer drops more than 5%, the PR check fails. No more “ship and pray.”

CI/CD Integration

GitHub Action that gates deploys:

- name: Agent Quality Check
  uses: neon/eval-action@v1
  with:
    suite: core-tests
    agent-path: ./src/agent.py
    threshold: '0.95'
    fail-on-regression: 'true'

What Makes This Different

Capability	Langfuse	Braintrust	Neon
Trace Collection	✅	✅	✅
OTel Ingestion	✅	❌	✅
Evaluation	✅	✅	✅
Durable Execution	❌	❌	✅ Temporal
Time-Travel Debug	❌	❌	✅
Human-in-the-Loop	❌	❌	✅
Self-Hosted	✅	❌	✅
Sub-100ms Queries	❌	❌	✅ ClickHouse

The fundamental difference: Langfuse and Braintrust are observability tools. Neon is an operations platform. It doesn’t just show you what happened—it gives you control over execution.

Project Structure

neon/
├── frontend/                 # Next.js 15 (UI + API)
│   ├── app/
│   │   ├── api/              # REST endpoints
│   │   │   ├── traces/       # Trace CRUD + ingestion
│   │   │   ├── spans/        # Span queries
│   │   │   ├── scores/       # Score management
│   │   │   └── dashboard/    # Aggregations
│   │   ├── traces/           # Trace viewer UI
│   │   ├── workflows/        # Temporal workflow UI
│   │   └── compare/          # A/B comparison
│   └── lib/
│       ├── clickhouse.ts     # Query builders (800 LOC)
│       └── temporal.ts       # Workflow client
│
├── temporal-workers/         # Durable execution
│   └── src/
│       ├── workflows/        # Agent + eval workflows
│       └── activities/       # LLM calls, tools
│
├── packages/sdk/             # @neon/sdk
│   ├── scorers/              # Built-in scorers
│   ├── test/                 # defineTest, datasets
│   └── cli/                  # npx neon eval
│
├── scripts/
│   └── clickhouse-init.sql   # Schema (400 LOC)
│
└── docker-compose.yml        # One-command local setup

Tech Stack

Layer	Choice	Why
Language	TypeScript	End-to-end type safety, from ClickHouse schema to React props
Frontend	Next.js 15, React 19	Server components, streaming, single deploy unit
Trace Storage	ClickHouse	Sub-100ms on millions of rows, 10x compression
Orchestration	Temporal	True durability, not just retries
Metadata	PostgreSQL	Projects, configs, user settings
Streaming	Redpanda (optional)	High-throughput ingestion when needed
Infra	Docker Compose	`docker compose up` and you’re running

Current Status

What’s built:

ClickHouse trace/span/score storage with materialized views
Next.js dashboard with trace viewer, span details, score trends
API routes for ingestion and queries
Docker Compose with all infrastructure
Lazy loading for large traces (PERF-004)
Dashboard aggregations with percentiles

What’s next:

Temporal eval workflows (in progress)
@neon/sdk package with scorers
GitHub Action for CI gating
A/B comparison UI
Dataset management

Try It

git clone https://github.com/Sean-Koval/neon.git
cd neon

# Start everything
docker compose up -d

# Open dashboard
open http://localhost:3000

# Send a test trace
curl -X POST http://localhost:3000/api/traces/ingest \
  -H "Content-Type: application/json" \
  -H "x-project-id: demo" \
  -d '{"trace_id": "test-001", "name": "my-agent", "status": "ok"}'