Self-hosted observability control plane for multi-tenant teams
Watchdog is built around Grafana, Loki, Tempo, Mimir, and Alertmanager, with application services that add tenancy, access control, alert workflows, and AI-assisted root cause analysis.
If you are new to the project, the simplest way to think about it is this: the Grafana stack is the storage and query layer, and Watchdog is the application layer that makes it practical for a team and enterprise to use that stack together.
In plain terms, this workspace gives you:
- A secure entry point for telemetry ingestion.
- A web UI for logs, traces, dashboards, alert rules, incidents, and RCA.
- A control-plane API that sits in front of the Grafana stack.
- An alerting service that stores channel, rule, silence, and incident state.
- An RCA engine that correlates logs, metrics, and traces to rank possible causes.
This repository is best understood as one product made of several cooperating services.
Watchdog aims to turn the raw LGTM stack into a usable multi-user application.
The base Grafana components already do storage and querying well:
- Loki stores and queries logs.
- Tempo stores and queries traces.
- Mimir stores and evaluates metrics and alert rules.
- Alertmanager handles alert routing and silences.
- Grafana renders dashboards and data sources.
Watchdog adds the pieces those components do not provide as a single opinionated product:
- Authentication and session management.
- User, group, permission, and API key management.
- Tenant-aware OTLP token validation.
- A single UI across observability, alerting, and RCA workflows.
- Shared integrations such as Jira and notification channels.
- Incident lifecycle tracking.
- AI-assisted RCA and anomaly workflows.
| Component | Role |
|---|---|
watchdog |
Main FastAPI control plane. Handles auth, users, groups, API keys, Grafana proxy bootstrap, Loki/Tempo/Mimir-facing APIs, system metrics, and secure proxying to Notifier and Resolver. |
gatekeeper |
OTLP token validation service for Envoy ext_authz. Validates x-otlp-token, applies allowlists and rate limits, and returns X-Scope-OrgID for downstream tenancy. |
notifier |
Alerting workflow service. Stores and serves alert rules, channels, silences, incidents, and Jira integrations. Consumes Alertmanager webhooks and protects most endpoints with an internal service token. |
resolver |
RCA and analysis engine. Reads logs, metrics, and traces from Loki, Mimir, and Tempo; runs anomaly detection and job-based RCA; stores RCA jobs and reports. |
ui |
React/Vite frontend. Exposes dashboards, logs, traces, alerts, incidents, integrations, API keys, users/groups, audit views, and RCA pages. |
docker-compose.yml |
Local reference deployment for the entire stack. |
.env.example |
Environment contract for all services. |
tests |
OTEL collector and sample telemetry generators used to feed demo traces and logs into the stack. |
- Watchdog (main control plane): https://github.com/observantio/watchdog
- Ojo (OpenTelemetry agent): https://github.com/observantio/ojo
- Notifier (alerting & incidents): https://github.com/observantio/notifier
- Resolver (RCA / AIops engine): https://github.com/observantio/resolver
This is the main application server.
From the code, it does all of the following:
- Boots the main database schema and auth service.
- Exposes login, logout, registration, OIDC exchange, MFA, user, group, audit, and API key endpoints.
- Stores and resolves the current user context, permissions, and API-key-backed scope.
- Proxies observability operations to Loki, Tempo, Grafana, Alertmanager, and Resolver.
- Exposes
/api/internal/otlp/validateso Gatekeeper can validate OTLP tokens against Watchdog's auth model. - Provides
/healthand/readychecks and a/api/system/metricsendpoint for internal UI metrics. - Sets security headers, request-size limits, concurrency limits, and CORS.
This service is the telemetry gatekeeper.
It is designed to sit behind Envoy's external authorization hook and does the following:
- Reads
x-otlp-tokenfrom inbound telemetry requests. - Applies optional IP allowlists.
- Applies request rate limiting.
- Caches token validation results in memory or Redis.
- Calls the Watchdog internal validation API when a cache miss occurs.
- Returns
X-Scope-OrgIDso Loki, Tempo, and Mimir receive the correct tenant scope.
Without this service, the system would still have storage backends, but not a protected multi-tenant OTLP ingestion path.
This service owns alerting workflows beyond raw Alertmanager delivery.
From the routers and services, it is responsible for:
- CRUD for alert rules.
- Importing rules from YAML, including a dry-run preview flow.
- Syncing rule definitions to Mimir for the target organization.
- CRUD for notification channels such as email, Slack, Teams, webhook, and PagerDuty.
- CRUD for silences.
- Maintaining incidents and enforcing incident lifecycle rules.
- Recording assignment and status changes.
- Sending assignment emails when configured.
- Jira integration management and Jira ticket/comment synchronization.
- Accepting inbound Alertmanager webhooks.
This service is the RCA engine.
It does not replace Loki, Tempo, or Mimir. It reads from them, analyzes their data, and produces reports.
Its responsibilities include:
- Waiting for logs, metrics, and trace backends to become reachable.
- Creating RCA jobs asynchronously.
- Listing and retrieving jobs and saved reports.
- Running anomaly analysis for metrics, logs, and traces.
- Running signal correlation, topology, causal, forecast, and SLO analysis endpoints.
- Storing RCA jobs and reports in its own database.
- Enforcing internal service-to-service auth and tenant-aware permission context.
The frontend is not a demo shell. It is the main operator experience.
The route map shows these primary pages:
- Dashboard: system summary cards and activity widgets.
- Logs: Loki query builder, raw LogQL mode, labels, quick filters, log volume, and saved state.
- Traces: Tempo query and exploration UI using Dependency maps.
- Alert Manager: active alerts, alert rules, silences, hidden items, rule import, and rule testing.
- Incidents: incident board with assignment, state changes, notes, Jira actions, and correlation labels.
- Grafana: dashboards, folders, datasources, and a controlled hand-off into Grafana through the auth proxy.
- RCA: job creation, queue view, saved report lookup, root-cause ranking, anomalies, topology, causal analysis, forecast/SLO views, and report deletion.
- Integrations: notification channels and Jira integrations with visibility and sharing controls.
- Users, Groups, API Keys, Audit/Compliance: access-management workflows.
The included docker-compose.yml brings up the full local stack:
postgresfor application data.redisfor rate limiting, token cache, and shared ephemeral state.watchdogas the main API.notifierfor alerts, incidents, and integrations.gateway-authfor OTLP auth.resolverfor RCA.otlp-gatewayas Envoy on port4320.loki,tempo,mimir, andalertmanageras the storage and routing backends.grafanaplusgrafana-proxyon port8080.uion port5173.otel-agentas a local telemetry generator harness which is underoteldir.
| Endpoint | Service | Purpose |
|---|---|---|
http://localhost:5173 |
ui |
Web UI |
http://localhost:4319 |
watchdog |
Main API and docs |
http://localhost:4320 |
otlp-gateway |
OTLP ingress through Envoy |
http://localhost:4323 |
notifier |
Alerting service |
http://localhost:8080 |
grafana-proxy |
Browser access to Grafana |
Internal-only services in the default compose layout:
gateway-auth(4321) is reachable on the Docker network, not via hostlocalhost.resolver(4322) is reachable on the Docker network, not via hostlocalhost.
If you want Kubernetes deployment instead of Docker Compose, use the chart under charts/observantio.
- Chart path:
charts/observantio - Chart docs:
charts/observantio/README.md - Installer script:
charts/observantio/installer.sh
Quick start:
bash charts/observantio/installer.sh --profile production --foregroundUseful installer modes:
--profile productionfor full production defaults--profile compactfor smaller/constrained clusters--detachfor background port-forwards--no-port-forwardwhen you only want deployment--removeto remove the release/namespace (smoke teardown)
Customization points:
- Base values:
charts/observantio/values.yaml - Production defaults:
charts/observantio/values-production.yaml - Compact overrides:
charts/observantio/values-compact.yaml - Image versions:
release/versions.jsonand chart values/image tags
The root .env.example is the configuration contract for the whole stack.
It is large because it configures multiple services at once. Read it in these groups:
- Core runtime: host, port, log level, database URLs.
- Auth: JWT signing, bootstrap admin, OIDC, Keycloak, MFA, cookie security.
- Ingestion security: OTLP tokens, gateway allowlists, rate limits, proxy trust settings.
- Service-to-service auth: shared tokens and signing keys for Notifier and Resolver.
- Alerting: channel types, webhook tokens, SMTP settings, Jira support.
- Grafana runtime: admin password, auth proxy config, datasource provisioning.
- Resolver analysis tuning: correlation window, thresholds, timeouts, quality gating.
- Optional Vault and backup settings.
Two practical warnings for new users:
- A few example values are placeholders, not safe defaults. Replace every
replace_with_...value. - Some example lines show choices such as
AUTH_PROVIDER=local | oidc | keycloak. You must replace those with one actual value, for exampleAUTH_PROVIDER=local.
The included installer is meant for evaluation and local testing. It is best to use the Experimental Installer if you want to develop the code, since it creates a working .env and starts all the required services cleanly for development.
It will:
- Check for required commands.
- Clone missing repos for
resolverandnotifier. - Create or update
.env. - Generate secrets and a bootstrap admin account.
- Start the compose stack.
curl -fsSL https://raw.githubusercontent.com/observantio/watchdog/main/install.py -o install.py && python3 install.pygit clone https://github.com/observantio/watchdog Observantio
cd Observantio
cp .env.example .envBefore you run docker compose up -d --build, generate the host-aware observability config files:
bash scripts/run_optimal_config.shFor local developer tooling, the workspace root and the resolver and notifier service folders now each include a pyproject.toml with the canonical pytest, coverage, and mypy defaults for that scope.
The root observantio package is a meta package for tooling and extras; install with extras (for example pip install -e ".[dev]" or pip install -e ".[schemathesis]") rather than expecting base runtime dependencies.
Then edit .env and set, at minimum:
- Strong Postgres password values.
DEFAULT_ADMIN_USERNAMEDEFAULT_ADMIN_PASSWORDDEFAULT_ADMIN_EMAILDATA_ENCRYPTION_KEYDEFAULT_OTLP_TOKENGATEWAY_INTERNAL_SERVICE_TOKENNOTIFIER_SERVICE_TOKENandNOTIFIER_EXPECTED_SERVICE_TOKENRESOLVER_SERVICE_TOKENandRESOLVER_EXPECTED_SERVICE_TOKENNOTIFIER_CONTEXT_SIGNING_KEYandNOTIFIER_CONTEXT_VERIFY_KEYRESOLVER_CONTEXT_SIGNING_KEYandRESOLVER_CONTEXT_VERIFY_KEY
Start the stack:
docker compose up -d --buildCheck health:
docker compose ps
curl http://localhost:4319/health
curl http://localhost:4319/ready
curl http://localhost:4323/healthFor internal services that are not published to host ports (gateway-auth, resolver), use docker compose logs or container-internal checks.
Global quality scripts in scripts/ support either all services or a single service argument (resolver, gatekeeper, notifier, watchdog).
Run all services:
scripts/run_global_mypy.sh
scripts/run_global_pylint.sh
scripts/run_global_pytests.shRun one service only:
scripts/run_global_mypy.sh watchdog
scripts/run_global_pylint.sh watchdog
scripts/run_global_pytests.sh watchdogUse -h on each script for the full usage contract and environment options.
- Open
http://localhost:5173. - Sign in with the bootstrap admin configured in
.env. - Create one or more API keys. These keys are not only UI objects; they drive tenant-scoped access and OTLP token usage.
- Choose which API key should be the default scope in the UI. That choice affects what the frontend queries and where new rules are targeted.
- Use the API Keys page to copy the OTLP token or generate a starter OpenTelemetry Collector YAML file.
- Send telemetry to
http://localhost:4320with thex-otlp-tokenheader. - Confirm data in Logs and Traces.
- Create or import alert rules, then connect channels and test them.
- Review incident creation and update flows.
- Run an RCA job after data exists.
The included test harness sends example traces and logs through a local OpenTelemetry Collector. If you want to connect your own collector, the important idea is:
- Logs go to
http://localhost:4320/loki - Traces go to
http://localhost:4320/tempo - Metrics go to
http://localhost:4320/mimir - Every request must include
x-otlp-token
A collector pattern to start from looks like this:
exporters:
otlphttp/logs:
endpoint: http://localhost:4320/loki
headers:
x-otlp-token: YOUR_OTLP_TOKEN
otlphttp/traces:
endpoint: http://localhost:4320/tempo
headers:
x-otlp-token: YOUR_OTLP_TOKEN
otlphttp/metrics:
endpoint: http://localhost:4320/mimir
headers:
x-otlp-token: YOUR_OTLP_TOKENThe alerting flow is intentionally opinionated:
- Rules are managed as application objects, not only as raw backend config.
- Rules are synchronized to Mimir for evaluation.
- Active alerts surface in the Watchdog UI.
- Alertmanager webhook events feed Notifier.
- Incidents become first-class objects with assignees, notes, and optional Jira linkage.
If you are new to the rule editor, start from a known-good template, then tune expressions and thresholds for your environment. That approach matches how the stack is built: validate the workflow first, then narrow noise and sensitivity.
The Dashboard provides a high-level view of platform health, including active alerts, log volume, dashboard count, silence count, datasource count, and overall service status.
If OIDC is enabled, operators are asked to set a backup local password during setup. This supports a fallback to local authentication if the business later decides to change authentication methods.
Dashboard widgets are draggable, so users can reorder components to suit their workflow. The UI also supports easy switching between dark and light themes.
The Logs view provides label discovery, builder-mode filtering, raw LogQL support, log volume visualisation, result browsing, and quick filters.
For most investigations, the quick filters are the fastest way to search text and review log volume over time, making it easier to identify bursts or unusual spikes in activity.
The Traces view provides Tempo-backed trace exploration, direct trace lookup, and a graph view for comparing traces and understanding service relationships.
Operators can filter traces, inspect trace data, and use the dependency map to identify pain points, bottlenecks, and issues in service-to-service data flow.
Alert Manager provides:
- Active alerts
- Alert rules
- Silences
- YAML rule import with preview
- Rule testing
- Hidden and shared object handling
Alerts and silences are fully scoped by tenant and channel configuration. Integrations such as Jira are also scoped appropriately. All related configuration is stored securely and encrypted in PostgreSQL.
The Incidents view provides a board-based operational workflow for managing incidents, including assignment, notes, status updates, and Jira integration.
Operators can create notes, assign incidents to users, and link incidents to Jira so that comments and lifecycle changes remain synchronised across both systems.
The API Keys area provides tenant and product scoping, OTLP token management, key sharing with users and groups, token regeneration, and a downloadable starter OpenTelemetry Collector configuration.
Operators can create a new API key, download a YAML configuration for that key, or use their own collector configuration with the provided token. Once the collector runs with otelcol-contrib --config otel.yaml, the platform accepts metrics, logs, and traces, and maps them to the correct organisation or tenant context for retrieval through Mimir, Tempo, Loki, and Resolver.
The Users and Groups section provides user creation, role and permission management, group-based permission inheritance, temporary password reset flows, and membership administration.
Operators can rename users, manage passwords, update permissions and roles, create groups, and assign group permissions that members inherit. A user cannot create a group with permissions higher than their own. The same restriction applies to users with manage:tenants capabilities — they can only grant permissions up to their own level.
Admins can update the roles of existing members. Only an admin can deactivate another admin, and admins cannot delete other admins.
The Audit and Compliance section provides searchable audit history with filters, detailed inspection, and CSV export for administrative review.
Audit records are not currently designed as immutable at the database level. However, there are no routes or services that allow audit logs to be edited or deleted.
The Grafana section provides controlled management of dashboards, folders, and datasources, along with secure access into the Grafana UI through the auth proxy.
All access is scoped according to the user’s permissions and visibility rights. Folder visibility acts as a container-level boundary for dashboards. If a folder is public, dashboard visibility still depends on the visibility settings of each individual dashboard.
The RCA section provides job creation, queue monitoring, historical report lookup, ranked root causes, anomaly detection, topology views, causal analysis, and forecast/SLO views.
This area is functionally in place, but it still requires real production data for full validation and testing.
There are three different security boundaries in this stack:
-
User-to-application auth. Watchdog handles login, sessions, permissions, API keys, and optional OIDC/Keycloak.
-
Telemetry-ingest auth. Gatekeeper validates
x-otlp-tokenbefore Envoy forwards data to Loki, Tempo, or Mimir. -
Service-to-service auth. Watchdog talks to Notifier and Resolver using dedicated service tokens and signed context JWTs.
- This workspace is well suited for local evaluation, demos, and homelab environments.
- The installer is explicitly experimental.
- The docs in this repository should be treated as the source of truth for this workspace, not older external deployment examples.
- Empty environments will not produce useful RCA. Resolver needs enough logs, metrics, and traces to correlate signals.
- Detailed walkthrough: User Guide
- Environment reference: Example Environment File
- Release deployment and hardening: Deployment Guide
This repository includes Apache 2.0 licensing and notice files in the root and service folders. Review them before redistribution or commercial use.


