A self-hosted OTLP observability platform in a single Go binary. This guide covers first-run, production checklist, data layout, backup, incident response, and known limits.
For AI-agent-oriented architecture context see ../CLAUDE.md. For the canonical env-var reference see ../.env.example.
./otelcontextWhat happens:
.envis loaded if present; otherwise defaults apply.- GORM
AutoMigratecreates tables inotelcontext.dbin the working directory. DEFAULT_TENANT=defaultis assigned to all rows ingested without an explicit tenant header.API_KEYis empty → auth middleware is a pass-through (every request allowed). A warning is logged.- No TLS is configured → HTTP and gRPC listen in plaintext. Dev only.
- Listeners: HTTP
:8080, gRPC:4317, Prometheus/metrics, liveness/live, readiness/ready.
DB_DRIVER=postgres \
DB_DSN="host=localhost user=otel password=otel dbname=otelcontext port=5432 sslmode=disable TimeZone=UTC" \
DB_AUTOMIGRATE=false \
API_KEY="$(openssl rand -hex 32)" \
./otelcontextSet DB_AUTOMIGRATE=false in production. AutoMigrate locks tables and has no rollback; run schema changes out-of-band (Flyway, goose, sqlc migrate, etc.).
Two paths:
-
Explicit cert files (wins if set):
TLS_CERT_FILE=/etc/otelcontext/tls/server.crt TLS_KEY_FILE=/etc/otelcontext/tls/server.key
Both must be set together; both must exist and be readable at startup.
-
Auto self-signed (dev/internal only):
TLS_AUTO_SELFSIGNED=true TLS_CACHE_DIR=./data/tls
Generates an ECDSA-P256 cert on first start, caches it under
TLS_CACHE_DIR, regenerates on expiry. Clients must trust the generated material (insecure-skip or CA pin).
When TLS is enabled, both HTTP (:8080) and gRPC (:4317) serve TLS only.
DB_DRIVER=postgres
DB_AZURE_AUTH=true
DB_DSN="host=my-server.postgres.database.azure.com user=my-mi@tenant.onmicrosoft.com dbname=otelcontext port=5432 sslmode=require"- DSN must omit the password; the
userfield is the Entra principal. sslmodemust berequire,verify-ca, orverify-full— weaker modes are rejected at startup.- Credential resolution: env vars → workload identity → managed identity → Azure CLI → developer credentials.
- Local dev:
az loginis sufficient. - AKS: use workload identity or pod-managed identity.
DB_CONN_MAX_LIFETIMEis internally capped to 30 minutes when Entra auth is active (tokens expire).
API_KEY— long random string. Without it, anyone on the network can query or ingest.DB_DRIVER=postgres(or another persistent driver). SQLite is fine for small single-node deployments; plan for it accordingly.DB_DSN— with strict TLS when crossing a network boundary.TLS_CERT_FILE+TLS_KEY_FILE(orTLS_AUTO_SELFSIGNED=truefor internal-only deployments).HOT_RETENTION_DAYS— pick a value you can defend. Default 7 is reasonable; range is 1..36500.
DB_MAX_OPEN_CONNS— size to match your Postgres pool and expected ingest concurrency.DEFAULT_TENANT— a non-defaultvalue if the deployment serves a specific tenant.OTEL_EXPORTER_OTLP_ENDPOINT— enables self-instrumentation. Set tolocalhost:4317to dogfood into the same instance.DB_AUTOMIGRATE=falsefor Postgres in production.
METRIC_MAX_CARDINALITY=10000,METRIC_MAX_CARDINALITY_PER_TENANT=0(unlimited per-tenant by default). For multi-tenant deployments, set the per-tenant cap to enforce fairness — a noisy tenant gets bounded before exhausting the global pool. Watchotelcontext_tsdb_cardinality_overflow_by_tenant_total{tenant_id}to identify offenders.DLQ_MAX_DISK_MB=500,DLQ_MAX_FILES=1000,DLQ_MAX_RETRIES=10API_RATE_LIMIT_RPS=100VECTOR_INDEX_MAX_ENTRIES=100000SAMPLING_*(defaults keep 100% + always-on errors)GRAPHRAG_WORKER_COUNT=16,GRAPHRAG_EVENT_QUEUE_SIZE=100000— sized for 100–200 services. Lower for tiny deployments; raise further ifgraphrag_events_dropped_totalclimbs.INGEST_ASYNC_ENABLED=true,INGEST_PIPELINE_QUEUE_SIZE=50000,INGEST_PIPELINE_WORKERS=8— async ingest pipeline. Decouples OTLPExport()from DB writes. Backpressure is hybrid: silent drop of healthy traces at >=90% queue, gRPCRESOURCE_EXHAUSTED(HTTP429 Too Many Requests+Retry-After: 1on the OTLP HTTP receiver) at 100%. Disable only to debug the legacy synchronous write path. Watchotelcontext_ingest_pipeline_dropped_total{signal,reason},otelcontext_ingest_pipeline_queue_depth{signal}, andotelcontext_http_otlp_throttled_total{signal}.GRPC_MAX_RECV_MB=16,GRPC_MAX_CONCURRENT_STREAMS=1000— OTLP gRPC server capsRETENTION_BATCH_SIZE=50000,RETENTION_BATCH_SLEEP_MS=1— purge pacing; raise the sleep for busy production DBsMCP_MAX_CONCURRENT=32,MCP_CALL_TIMEOUT_MS=30000,MCP_CACHE_TTL_MS=5000— MCP HTTP streamable robustness. Concurrenttools/callinvocations are gated by a counting semaphore (returns JSON-RPC-32000"server overloaded" past the cap). Per-call deadlines abort runaway tool handlers (returns JSON-RPC-32001"call timeout"). Cheap GraphRAG tools (get_service_map,impact_analysis,root_cause_analysis,get_anomaly_timeline,get_service_health) are memoized for the TTL window, keyed by(tenant, tool, args). Setting any of these to0disables that protection.
SQLite is rejected at startup when APP_ENV=production unless you explicitly opt in with OTELCONTEXT_ALLOW_SQLITE_PROD=true. The guard exists because SQLite uses a single writer lock — fine for < ~10 services at low QPS, miserable at scale. Prefer Postgres for anything resembling production.
| Location | What lives here |
|---|---|
DB_DSN (relational) |
Logs, traces, spans, metric buckets, investigations, graph snapshots, Drain templates. Single source of truth. |
DLQ_PATH (./data/dlq default) |
Failed-ingest envelopes awaiting replay. Bounded by DLQ_MAX_DISK_MB. |
TLS_CACHE_DIR (./data/tls default) |
Auto-self-signed cert + key material. |
| Working directory (SQLite only) | otelcontext.db when DB_DRIVER=sqlite. |
Retention. RetentionScheduler runs hourly. It batches PurgeLogsBatched, PurgeTracesBatched, and PurgeMetricBucketsBatched against rows older than HOT_RETENTION_DAYS, plus a daily VACUUM/ANALYZE pass. Purge is cross-tenant (it does not scope by tenant_id).
Multi-tenancy. Every row carries a tenant_id column. The write path reads X-Tenant-ID (HTTP) or x-tenant-id (gRPC metadata) and populates the column. The read path attaches the tenant from the request context to every repository query (Where("tenant_id = ?", ...)).
| Setting | Default | When to enable |
|---|---|---|
DB_POSTGRES_PARTITIONING |
"" (off) |
High-volume Postgres deployments where row-level retention DELETE on logs becomes the bottleneck |
DB_PARTITION_LOOKAHEAD_DAYS |
3 |
Future daily partitions to keep staged. Raise if your DB is far from the app or your retention is short |
When DB_POSTGRES_PARTITIONING=daily:
logsis provisioned as aRANGE PARTITION BY (timestamp)parent with the composite PK(id, timestamp). AutoMigrate sees an existing table and skips the model.- Initial partitions cover yesterday + today +
lookaheadfuture days. Yesterday absorbs late-arriving events at the day-boundary rollover. - The PartitionScheduler runs hourly, idempotently ensures upcoming partitions exist, and drops any partition whose entire range predates the retention cutoff. Drop is
DROP TABLE IF EXISTS <child>— orders of magnitude faster than the row-by-row DELETE. - RetentionScheduler skips its
logsDELETE branch when partitioning is active.tracesandmetric_bucketscontinue to use the existing batched DELETE path. pg_trgmGIN indexes onlogs.bodyandlogs.service_nameare declared on the parent and propagate automatically to current and future partitions.
Greenfield only. Startup refuses to enable partitioning if logs already exists as a non-partitioned table — migrating an unpartitioned logs to a partitioned one requires copying rows into a swapped table and is out of scope for the current phase. Drop the table or migrate manually before flipping the flag.
Telemetry:
otelcontext_partitions_dropped_total— increments bynwhen the scheduler dropsnpartitions on a tickotelcontext_partitions_active— current partition count attached to the parent. Steady-state ≈HOT_RETENTION_DAYS + DB_PARTITION_LOOKAHEAD_DAYS + 1. Alert when this gauge climbs unbounded (drop loop stuck) or falls toward zero (over-aggressive drop)
| Driver | Index | Ranking |
|---|---|---|
| SQLite | FTS5 virtual table logs_fts over (body, service_name), kept in sync via AFTER INSERT/DELETE/UPDATE triggers on logs |
bm25(logs_fts) ascending (lower = more relevant) |
| Postgres | pg_trgm GIN indexes on logs.body and logs.service_name |
Recency (timestamp desc) — substring ILIKE |
| MySQL / SQL Server | None — sequential LIKE scan |
Recency |
The FTS5 path uses tokenize='porter unicode61 remove_diacritics 2' — case-insensitive, accent-insensitive, English-stemmed (so panic matches panicked). User input is escaped and prefix-suffixed (*) so partial words like conn still match connection. If FTS5 errors at query time, the repository transparently falls back to LIKE so a misbehaving index does not surface as a 500 to the API.
The FTS5 table is provisioned automatically by AutoMigrateModels on every SQLite boot; setup is idempotent. To rebuild after corruption or a manual schema change:
INSERT INTO logs_fts(logs_fts) VALUES('rebuild');The Postgres pg_trgm path requires the extension; if missing, AutoMigrate logs a warning and ILIKE falls back to a sequential scan. To install:
CREATE EXTENSION pg_trgm;Phase 3b will add Postgres declarative partitioning as an opt-in adapter; at that point the GIN indexes will be created per-partition. There is no migration required to use FTS5 — existing SQLite databases are backfilled the first time the upgraded binary boots.
Online backup (does not block writers):
sqlite3 otelcontext.db ".backup /backups/otelcontext-$(date +%F).db"
# or
sqlite3 otelcontext.db "VACUUM INTO '/backups/otelcontext-$(date +%F).db'"Restore:
sqlite3 /backups/otelcontext-YYYY-MM-DD.db "VACUUM INTO './otelcontext.db'"Operator-owned:
pg_dump -Fc -d otelcontext -f /backups/otelcontext-$(date +%F).dumpRestore:
pg_restore -d otelcontext --clean --if-exists /backups/otelcontext-YYYY-MM-DD.dump- Hourly purge removes rows outside the retention window. If you care about data from within the last hour, back up before the top of the hour.
- Daily is fine for the platform use case (platform state is not the same as user application data).
- Test restore quarterly against a scratch instance.
Diagnostic tree:
- DB unreachable. Check the
OtelContext_db_upgauge. If 0, the repository lost its connection. Inspect DB logs, network, credentials (especially Entra token refresh). - GraphRAG wedged. Symptom:
/readypasses DB check but latency spikes on MCP tool calls. Restart the process; graph is rebuilt from the DB on boot. - DLQ backlog. Compare
OtelContext_dlq_disk_bytesagainstDLQ_MAX_DISK_MB. If near the cap, downstream replay is failing — check ingestion target andOtelContext_dlq_replay_failure_total.
Check OtelContext_otlp_payload_rejected_total (labeled by reason: too_large, invalid_content_type, decode_error, etc.). Review recent logs for the specific reject reason.
Alert on both:
OtelContext_retention_consecutive_failures> 3now() - OtelContext_retention_last_success_timestamp> 2h
Typical causes: DB lock contention, disk full, permissions on the DB file (SQLite).
Grep structured logs for acquire entra token. Common causes: expired managed-identity binding, misconfigured AZURE_CLIENT_ID, az login expired (dev).
- Prometheus:
/metrics/prometheus— public by design (no secrets). Scrape from your existing Prometheus / VictoriaMetrics / etc. - Health probes:
/live— process is alive (always 200 unless the HTTP server is down)./ready— dependencies are healthy (DB reachable, core subsystems running). Use for load-balancer health checks and Kubernetes readiness.
- Key metrics to alert on:
OtelContext_db_up == 0OtelContext_dlq_disk_bytes / (DLQ_MAX_DISK_MB * 1024 * 1024) > 0.8OtelContext_retention_consecutive_failures > 3rate(OtelContext_otlp_payload_rejected_total[5m]) > 0rate(OtelContext_dlq_replay_failure_total[5m]) > rate(OtelContext_dlq_replay_success_total[5m])rate(otelcontext_graphrag_events_dropped_total[5m]) > 0— ingestion channel saturated; bumpGRAPHRAG_WORKER_COUNTorGRAPHRAG_EVENT_QUEUE_SIZErate(otelcontext_ingest_pipeline_dropped_total{reason="queue_full"}[5m]) > 0— clients are gettingRESOURCE_EXHAUSTED; raiseINGEST_PIPELINE_QUEUE_SIZEorINGEST_PIPELINE_WORKERS. Sustained drops mean the DB cannot keep up with the ingest rate.rate(otelcontext_ingest_pipeline_dropped_total{reason="soft_backpressure"}[5m]) > 0— pipeline is actively shedding healthy traces; check downstream DB latency or scale workers/queue.otelcontext_ingest_pipeline_queue_depth / INGEST_PIPELINE_QUEUE_SIZE > 0.7for >5m — queue trending toward soft drop; capacity is becoming a constraint.topk(5, sum by (tenant_id) (rate(otelcontext_tsdb_cardinality_overflow_by_tenant_total[5m]))) > 0— identifies which tenants are exhausting their metric series budget. Combine withMETRIC_MAX_CARDINALITY_PER_TENANTto enforce fairness.otelcontext_retention_rows_behind > 1_000_000— purge is falling behind; tuneRETENTION_BATCH_SIZE/RETENTION_BATCH_SLEEP_MSotelcontext_db_pool_in_use / otelcontext_db_pool_max_open > 0.9— pool exhausted; raiseDB_MAX_OPEN_CONNSrate(otelcontext_dlq_evicted_total[5m]) > 0— DLQ is actively dropping entries at cap; replay target is down or slowrate(otelcontext_dashboard_p99_row_cap_hits_total[1h]) > 0on SQLite — dataset exceeds the 200k in-memory cap; migrate to Postgres for accurate p99
- Log levels:
LOG_LEVEL=DEBUGfor deep diagnostics, defaultINFO.WARNorERRORis too quiet for a running system; avoid in prod.
- Single-instance only. No leader election. Running two replicas against the same DB will double-purge (retention runs on both) and double-snapshot (GraphRAG snapshot loop runs on both). Use a single replica behind your LB, or shard by tenant.
- Tenant isolation is API-layer. A shared
API_KEYgrants blanket access to every tenant. There is no per-tenant-key file in the current codebase; isolate tenants at the network/auth layer if that matters. - No built-in TLS cert rotation beyond
TLS_AUTO_SELFSIGNEDregenerating on expiry. For managed certs, re-mount and restart on rotation. - GraphRAG is in-memory. The topology is rebuilt from the DB on boot. Very large corpora (millions of services/operations) will extend boot time.
- Cold archive is not part of the current build. Historical data beyond
HOT_RETENTION_DAYSis deleted, not archived. If you need long-term retention, extendHOT_RETENTION_DAYSor export via a downstream pipeline.
The backend is sized for 100–200 services emitting OTLP at commodity rates. A programmatic load simulator ships with the repo to verify this.
make loadtest-build # produces bin/loadsim
./bin/loadsim # 200 producers × 50 spans/sec × 60s against localhost:4317
./bin/loadsim --help # flags: --endpoint, --services, --rate, --duration, --tenant-id, --warmupThe binary is under the loadtest build tag — go build ./... and go test ./... ignore it. make loadtest runs a full 60s sweep against localhost:4317.
During a 60s / 200-service run against a warm instance on Postgres:
- Ingestion: no
otlp_payload_rejected_totalsamples, nographrag_events_dropped_totalsamples. - DB pool:
db_pool_in_use / db_pool_max_openstays below ~0.8. - Retention:
retention_rows_behindstays within one hourly tick of steady state. - DLQ: zero activity (
dlq_evicted_total,dlq_replay_failure_totalunchanged). - The dashboard p99 gauge updates without hitting the SQLite row cap.
If any of those trip, use the corresponding metric alert from the Observability section above as the entry point.
- Before cutting a release that touches the ingestion path or GraphRAG.
- After tuning any of:
GRAPHRAG_WORKER_COUNT,GRPC_MAX_CONCURRENT_STREAMS,RETENTION_BATCH_SIZE,DB_MAX_OPEN_CONNS. - When scaling the deployment past the current-tested envelope (e.g., 500+ services) — expand the simulator's
--servicesflag to match.
OtelContext is an OTLP destination, not a collector. Beyond ~150 services emitting unsampled telemetry, put a Collector in front to absorb cardinality, batch efficiently, and drop low-value traces before they hit the DB. SDKs → Collector → OtelContext.
- Aggregate ingest rate exceeds ~30k spans/s — DB writes become the bottleneck before the wire does.
- You need processors OtelContext doesn't run:
tail_sampling,batch,memory_limiter,transform,filter,attributes. - You ingest from non-OTLP sources (Jaeger, Zipkin, Prometheus scrape, Fluent, syslog, Kafka) — OtelContext only speaks OTLP.
- Multi-region: edge Collectors batch + compress before crossing the WAN.
The two highest-impact processors are tail_sampling (10–20× volume reduction with full diagnostic value retained) and batch (cuts gRPC overhead per span). memory_limiter is mandatory in front of any Collector exposed to bursty traffic.
# otelcol-edge.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1024
spike_limit_mib: 256
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 5000
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-always
type: latency
latency: { threshold_ms: 500 }
- name: probabilistic-healthy
type: probabilistic
probabilistic: { sampling_percentage: 5 }
batch:
send_batch_size: 8192
send_batch_max_size: 10000
timeout: 2s
exporters:
otlp/otelcontext:
endpoint: otelcontext.internal:4317
tls:
insecure: false
headers:
authorization: "Bearer ${env:OTELCONTEXT_API_KEY}"
x-tenant-id: "${env:TENANT_ID}"
sending_queue:
enabled: true
queue_size: 10000
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/otelcontext]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/otelcontext]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/otelcontext]- Errors and slow traces are always sampled. OtelContext's internal sampler does the same; keep parity at the edge so error/diagnostic data is never dropped.
- 5% probabilistic on healthy traces is the right default for 150–200 services. Adjust based on the volume you can store within
HOT_RETENTION_DAYS. - The
tail_samplingprocessor needs ~10s of buffer per trace to make the decision after spans have arrived — thedecision_waitsetting. Memory cost:decision_wait × spans_per_sec × avg_span_size. Plan for 256 MiB+ on the Collector at 30k spans/s.
If the edge Collector applies tail-sampling, set SAMPLING_RATE=1.0 on OtelContext. The SDK → Collector → OtelContext chain should sample exactly once. Default OtelContext config already keeps 100%, so no change is needed unless you previously tuned it.
- Back up the DB (see Backup & Restore above).
- Read the CHANGELOG for breaking changes between your current and target versions.
- SQLite:
DB_AUTOMIGRATE=true(default) handles schema upgrades in place. - Postgres in production: keep
DB_AUTOMIGRATE=falseand apply migrations out-of-band before starting the new binary. - Roll the new binary. Watch
/ready,OtelContext_db_up, andretention_*metrics for the first hour.
If the new version fails to start:
- For SQLite, restore the pre-upgrade backup with
VACUUM INTO. - For Postgres, restore via
pg_restore --clean --if-exists.