diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 192c45f..4ba05d8 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,56 +1,245 @@ # Contributing to docsiq -## Prerequisites +Welcome — and thank you for considering a contribution. This document is +the concrete guide for getting docsiq building on your machine, making a +change, and submitting it. -docsiq requires a **C toolchain at build time** because it uses the -CGO-backed `github.com/mattn/go-sqlite3` driver (with FTS5) and ships the -`sqlite-vec` extension as a loadable `.so` / `.dylib`. Pure-Go builds -(`CGO_ENABLED=0`) are no longer supported. +## TL;DR -| OS | Requirement | -|---------|---------------------------------------------------------| -| Linux | `build-essential` (gcc, make) — `apt-get install build-essential` | -| macOS | Xcode Command Line Tools — `xcode-select --install` | -| Windows | **Not supported.** Do not open issues for Windows. | +```bash +# 1. Fork + clone +git clone https://github.com//docsiq && cd docsiq -You also need Go **≥ 1.22** and Node.js 20+ for the UI build. +# 2. Install UI deps and build the SPA (needed by the Go embed) +npm --prefix ui ci +npm --prefix ui run build -Build locally: +# 3. Build and test the Go binary +CGO_ENABLED=1 go build -tags sqlite_fts5 -o docsiq ./ +CGO_ENABLED=1 go test -tags sqlite_fts5 ./... -```bash -make build # CGO_ENABLED=1 go build -tags sqlite_fts5 ./... -make vet test # same tags, CGO on +# 4. Run the UI tests +npm --prefix ui run typecheck +npm --prefix ui test -- --run --coverage ``` -`go install github.com/RandomCodeSpace/docsiq@latest` continues to -work for end users provided they have the C toolchain listed above. +If these all pass, you're ready to make changes. + +## Prerequisites + +docsiq requires a **C toolchain at build time** because it uses the +CGO-backed `github.com/mattn/go-sqlite3` driver (with FTS5) and ships the +`sqlite-vec` extension as a loadable asset. Pure-Go builds +(`CGO_ENABLED=0`) are not supported. + +| OS | Requirement | +|---------|--------------------------------------------------------------------| +| Linux | `build-essential` (gcc, make) — `apt-get install build-essential` | +| macOS | Xcode Command Line Tools — `xcode-select --install` | +| Windows | **Not supported.** Do not open issues for Windows. | + +You also need: + +- **Go** — version from `go.mod` (`go mod edit -json | jq -r .Go`) or + newer. +- **Node.js** — 20.x or newer, for the Vite-based UI. +- **SQLite FTS5 + sqlite-vec** — both are linked into the binary via the + `sqlite_fts5` build tag and the vendored `sqlite-vec` extension. No + separate install is needed; CGO pulls them in. +- **Git** — 2.30+. ## sqlite-vec prebuilt binaries The `sqlite-vec` loadable extension is embedded into the Go binary via -`internal/sqlitevec/assets/`. Contributors DO NOT need these binaries for -day-to-day development (the runtime gracefully falls back to in-memory -HNSW / brute-force search when the embedded asset is a 0-byte placeholder). -However, release builds must ship the real artefacts — see +`internal/sqlitevec/assets/`. Contributors do **not** need these +binaries for day-to-day development — the runtime gracefully falls back +to in-memory HNSW / brute-force search when the embedded asset is a +0-byte placeholder. Release builds must ship the real artefacts; see `internal/sqlitevec/assets/README.md` for the download / drop-in procedure. -## Committing `ui/dist/` (built UI assets) +## Local dev loop + +docsiq has two surfaces that move at different speeds. + +### Backend (Go) + +```bash +# Build a binary +CGO_ENABLED=1 go build -tags sqlite_fts5 -o docsiq ./ + +# Run all unit tests (fast) +CGO_ENABLED=1 go test -tags sqlite_fts5 ./... + +# Run integration tests with race detector (slow) +CGO_ENABLED=1 go test -tags "sqlite_fts5 integration" -race -timeout 1200s ./... + +# Format + vet +gofmt -s -w . +go vet -tags sqlite_fts5 ./... +``` + +A `Makefile` wraps the common targets (`make build`, `make vet test`) +with the correct tags and CGO setting. + +### Frontend (React SPA) + +```bash +cd ui + +# Dev server with HMR against a running `docsiq serve` +npm run dev + +# Typecheck (tsc --noEmit) +npm run typecheck + +# Unit tests with coverage +npm test -- --run --coverage + +# Production build (output to ui/dist/) +npm run build +``` + +The Go binary embeds `ui/dist/` via `//go:embed`, so **any UI change +requires `npm --prefix ui run build` before the change shows up in a +built binary**. For iterative UI work, run `./docsiq serve` in one +terminal and `npm --prefix ui run dev` in another; Vite will proxy API +calls to the backend. + +`ui/dist/` is **not committed** — the repo only ships a tiny +`ui/dist/index.html` placeholder so `//go:embed ui/dist` compiles. CI +rebuilds the UI and passes `ui/dist/` to each Go job as an artifact. + +### End-to-end -docsiq is distributed via `go install`, which cannot run `npm` or `vite`. -The Go binary therefore embeds the pre-built UI from `ui/dist/` via -`ui/embed.go` (`//go:embed dist`), and those build outputs **must be committed -to git**. The root `.gitignore` explicitly un-ignores `ui/dist/` for this -reason. +```bash +# Boot the binary against a fixture data dir +DOCSIQ_DATA_DIR=/tmp/docsiq-dev ./docsiq serve --port 37778 & +cd ui && npm run e2e +``` + +## Pre-commit hooks (optional but recommended) -Whenever you change anything under `ui/src/` (or any file that affects the -Vite build), run the following before committing and include the regenerated -assets in the same commit: +We recommend [pre-commit](https://pre-commit.com/) to keep `gofmt`, +`go vet`, and `prettier` from slipping. A minimal config you can drop +into your fork (not committed to the repo): +```yaml +# .pre-commit-config.yaml — keep on your fork, or PR it to the repo +repos: + - repo: https://github.com/dnephin/pre-commit-golang + rev: v0.5.1 + hooks: + - id: go-fmt + - id: go-vet + - repo: https://github.com/pre-commit/mirrors-prettier + rev: v3.1.0 + hooks: + - id: prettier + files: \.(ts|tsx|js|jsx|css|md)$ ``` -npm --prefix ui run build && git add ui/dist + +Then: + +```bash +pip install pre-commit # or: brew install pre-commit +pre-commit install ``` -CI enforces this via the `ui-dist freshness` workflow -(`.github/workflows/ui-freshness.yml`), which rebuilds the UI and fails the -job if the committed `ui/dist/` has drifted from the build output. +## Commit style + +We use **Conventional Commits**. The subject line is: + +``` +(): +``` + +Types we use: + +- `feat` — user-facing new capability. +- `fix` — user-visible bug fix. +- `refactor` — internal reshuffle, no behaviour change. +- `perf` — measured speedup or memory reduction. +- `docs` — documentation only. +- `test` — tests only, no production code change. +- `chore` — tooling, CI, dependency bumps. +- `build` — build system, Makefile, CI pipeline. + +Scope (optional) is usually a directory or subsystem — +`feat(extractor): …`, `fix(ui): …`, `chore(deps): …`. + +**Message body** — wrap at 72 chars, explain *why*, not *what*. The diff +already shows *what*. + +**Footer** — when the commit is authored or pair-coded with an AI agent, +include a `Co-Authored-By:` trailer. Do **not** force-push to shared +branches (`main`, `release/*`). + +## Pull requests + +### Before opening + +- Branch from `main`. Use a descriptive branch name: + `feat/community-summaries`, `fix/mcp-handshake-timeout`, etc. +- Rebase on top of the latest `main` before opening the PR. +- Run the full test suite (Go + UI) — `go test`, `npm test`, typecheck, + build. CI will run them anyway; running locally is faster feedback. +- Re-read your own diff. `git diff main...HEAD` in the terminal, top to + bottom. Most self-inflicted review comments are caught this way. + +### PR description + +No PR template is configured at this time; use a clear, +conventional-commit-style title and describe: + +1. **Problem / motivation** — one or two sentences on what this PR + changes and why. +2. **Approach** — key design choices you made and anything you + explicitly rejected. +3. **Tests** — which layers are covered (unit / integration / e2e) and + what remains untested. +4. **Screenshots** — for UI changes, include before/after. See + `docs/screenshots/` for capture conventions. +5. **Follow-ups** — known limitations or deferred items (with a link + to an issue where possible). + +### During review + +- Be responsive but not hasty — take time to think about comments. +- If a reviewer is wrong, push back with a clear reason. We prefer + disagreement over silent compliance. + +### After merge + +- Delete your branch. +- If your PR earned a release-worthy mention, add a line to the next + release's draft notes (maintainers can help). + +## Coding conventions + +- **Go** — follow `gofmt`; `go vet` must pass; error wrapping is + `fmt.Errorf("context: %w", err)`; logging via `slog` with the emoji + prefixes used elsewhere (📄 ✅ ⚠️ ❌ 🔗 🧩 💾 🌐 ⏭️ ⚙️); concurrency + uses semaphore channels (`make(chan struct{}, N)`) for bounded + parallelism. +- **TypeScript** — strict mode on; prefer explicit types over `any`; + CSS lives in `globals.css` `@layer components`, JSX uses semantic + class names only — Tailwind utilities stay inside shadcn primitives. +- **Tests** — write tests for new logic, including failure paths, not + just happy path. Flaky tests are broken tests — fix, quarantine, or + delete in the same PR. + +## License and CLA + +By contributing, you agree your contributions will be licensed under the +same MIT license as the project (see [LICENSE](LICENSE)). We do not +require a separate CLA. + +## Code of conduct + +Be kind. See [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md). + +## Questions + +Open a GitHub discussion or a draft PR. Small questions are welcome — +nobody was born knowing GraphRAG. diff --git a/README.md b/README.md index da75d98..51c0c79 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,60 @@ # docsiq -[![Security Scan](https://github.com/RandomCodeSpace/docsiq/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/RandomCodeSpace/docsiq/actions/workflows/ci.yml) +[![CI](https://github.com/RandomCodeSpace/docsiq/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/RandomCodeSpace/docsiq/actions/workflows/ci.yml) +[![CodeQL](https://github.com/RandomCodeSpace/docsiq/actions/workflows/codeql.yml/badge.svg?branch=main)](https://github.com/RandomCodeSpace/docsiq/actions/workflows/codeql.yml) [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/12628/badge)](https://www.bestpractices.dev/projects/12628) -[![OpenSSF Score](https://api.scorecard.dev/projects/github.com/RandomCodeSpace/docsiq/badge)](https://scorecard.dev/viewer/?uri=github.com/RandomCodeSpace/docsiq) +[![OpenSSF Scorecard](https://api.scorecard.dev/projects/github.com/RandomCodeSpace/docsiq/badge)](https://scorecard.dev/viewer/?uri=github.com/RandomCodeSpace/docsiq) +[![Go Report Card](https://goreportcard.com/badge/github.com/RandomCodeSpace/docsiq)](https://goreportcard.com/report/github.com/RandomCodeSpace/docsiq) +[![License: MIT](https://img.shields.io/github/license/RandomCodeSpace/docsiq)](LICENSE) [![Release](https://img.shields.io/github/v/release/RandomCodeSpace/docsiq?include_prereleases&sort=semver)](https://github.com/RandomCodeSpace/docsiq/releases) [![Go Version](https://img.shields.io/github/go-mod/go-version/RandomCodeSpace/docsiq)](https://github.com/RandomCodeSpace/docsiq/blob/main/go.mod) -docsiq is a GraphRAG-powered knowledge base that runs as a single Go binary. -It ingests unstructured documents, builds a knowledge graph with -community detection, persists wikilinked markdown notes, and exposes the -whole thing over **MCP + an embedded React SPA** on one port. +**A single-binary GraphRAG knowledge base — index documents, extract an +entity graph, ask questions across it, and browse the result in an +embedded React UI over MCP.** + +## Three-minute onboarding + +```bash +# 1. Install (Linux amd64 shown; macOS arm64 is published alongside) +VERSION=$(curl -s https://api.github.com/repos/RandomCodeSpace/docsiq/releases/latest | grep tag_name | cut -d '"' -f4) +curl -LO "https://github.com/RandomCodeSpace/docsiq/releases/latest/download/docsiq-${VERSION}-linux-amd64" +chmod +x "docsiq-${VERSION}-linux-amd64" && sudo mv "docsiq-${VERSION}-linux-amd64" /usr/local/bin/docsiq + +# 2. Index the sample corpus +git clone https://github.com/RandomCodeSpace/docsiq && cd docsiq +docsiq init && docsiq index docs/samples/ + +# 3. Ask a question +docsiq search "What are the main themes in this corpus?" +``` + +For a UI session: + +```bash +docsiq serve +# → http://localhost:8080 +``` + +Full walk-through with expected output: [docs/quickstart.md](docs/quickstart.md). + +## Screenshots + +| Home | Graph | +|---|---| +| ![Home view](docs/screenshots/home.png) | ![Graph view](docs/screenshots/graph.png) | + +More: [Notes](docs/screenshots/notes.png) · +[Documents](docs/screenshots/documents.png) · +[MCP Console](docs/screenshots/mcp.png). + +## What it does + +docsiq is a GraphRAG-powered knowledge base that runs as a single Go +binary. It ingests unstructured documents, builds a knowledge graph +with community detection, persists wikilinked markdown notes, and +exposes the whole thing over **MCP + an embedded React SPA** on one +port. Inspired by [Microsoft GraphRAG](https://github.com/microsoft/graphrag); storage is CGO-backed SQLite (`mattn/go-sqlite3` with FTS5) + the @@ -18,60 +63,43 @@ vector search. ## Features -- **GraphRAG pipeline** — load → chunk → embed → extract entities/ - relationships/claims → detect communities, all in one `docsiq index` run. +- **GraphRAG pipeline** — load → chunk → embed → extract entities / + relationships / claims → detect communities, all in one + `docsiq index` run. - **Notes subsystem** — markdown on disk with `[[wikilinks]]`, project scopes, cross-project references, and a live note graph view. Works without any LLM configured. -- **Interactive graph** — SVG force-directed viz with d3-zoom (pinch/wheel - pan/zoom 0.1×–40×), hover-to-highlight neighbourhood, degree-scaled nodes. -- **Community detection** — pure-Go Louvain, hierarchical, no external deps. +- **Interactive graph** — SVG force-directed viz with d3-zoom + (pinch/wheel pan/zoom 0.1×–40×), hover-to-highlight neighbourhood, + degree-scaled nodes. +- **Community detection** — pure-Go Louvain, hierarchical, no external + deps. - **Three LLM providers** — Azure OpenAI, OpenAI, Ollama — via [`tmc/langchaingo`](https://github.com/tmc/langchaingo). Set `provider: "none"` to run the server in notes-only mode with no LLM. -- **MCP server** — 12+ tools (local/global search, graph walk, community - reports, note read/write, …) exposed at `/mcp` via Streamable HTTP - transport with session handshake. +- **MCP server** — 12+ tools (local/global search, graph walk, + community reports, note read/write, …) exposed at `/mcp` via + Streamable HTTP transport with session handshake. - **Embedded SPA** — React 19 + Tailwind 4 + shadcn/ui, served from `//go:embed ui/dist`. PWA-installable with manifest + service worker. - **Per-repo projects** — each scope has its own SQLite store + notes directory, addressable by slug. -## Quickstart - -```bash -# Clone, install UI deps, build UI, build Go binary -git clone https://github.com/RandomCodeSpace/docsiq -cd docsiq -npm --prefix ui ci -npm --prefix ui run build -CGO_ENABLED=1 go build -tags sqlite_fts5 -o docsiq ./ - -# Register the current git repo as a project -./docsiq init - -# Index a docs folder -./docsiq index ~/path/to/docs - -# Start the server (UI + API + MCP) -./docsiq serve --port 37778 -# → http://localhost:37778 -``` - ## UI -- **Stack**: React 19, Vite 6, Tailwind 4, shadcn/ui primitives, Geist typography, Lucide icons +- **Stack**: React 19, Vite 6, Tailwind 4, shadcn/ui primitives, Geist + typography, Lucide icons. - **Architecture**: CSS lives in a single `globals.css` with an `@layer components` section; JSX uses semantic class names only; - shadcn primitives are the only place Tailwind utilities live inline -- **Navigation**: labeled sidebar (Home · Notes · Documents · Graph · MCP) - with ⌘K command palette + shadcn primitives are the only place Tailwind utilities live inline. +- **Navigation**: labelled sidebar (Home · Notes · Documents · Graph · + MCP) with ⌘K command palette. - **Responsiveness**: mobile drawer via shadcn `Sheet`; iOS safe-area - respected; inputs forced to 16px below `sm:` to kill Safari auto-zoom + respected; inputs forced to 16px below `sm:` to kill Safari auto-zoom. - **PWA**: manifest + 192/512 PNG icons + minimal service worker, - installable on Android/iOS + installable on Android/iOS. - **Hard reload**: refresh button in the header purges service worker + - CacheStorage and reloads from network — mobile-friendly `⌘⇧R` substitute + CacheStorage and reloads from network — mobile-friendly `⌘⇧R` substitute. ### Keyboard shortcuts @@ -88,10 +116,10 @@ CGO_ENABLED=1 go build -tags sqlite_fts5 -o docsiq ./ ## MCP -docsiq speaks the MCP Streamable HTTP transport at `POST /mcp`. The UI's -MCP Console (inspector-style) gives you the same tool list with typed -argument forms. For external clients (Claude Desktop, Cursor, etc.) -register the server URL directly, or use the hooks helper: +docsiq speaks the MCP Streamable HTTP transport at `POST /mcp`. The +UI's MCP Console (inspector-style) gives you the same tool list with +typed argument forms. For external clients (Claude Desktop, Cursor, +etc.) register the server URL directly, or use the hooks helper: ```bash docsiq hooks install --client claude-desktop @@ -124,7 +152,9 @@ ui/ React 19 + Vite 6 SPA, embedded at compile time ## Configuration Config lives at `~/.docsiq/config.yaml`; every key can be overridden by -an env var with prefix `DOCSIQ_` (dots → underscores, uppercased). +an env var with prefix `DOCSIQ_` (dots → underscores, uppercased). A +fully annotated reference with every option, default, and env var is at +[`configs/docsiq.example.yaml`](configs/docsiq.example.yaml). ```yaml server: @@ -134,38 +164,18 @@ server: llm: provider: ollama # azure | openai | ollama | none - azure: - chat_endpoint: https://chat.openai.azure.com - chat_api_key: ${AZURE_OPENAI_KEY} - chat_model: gpt-4o - chat_api_version: "2024-08-01" - embed_endpoint: https://embed.openai.azure.com - embed_api_key: ${AZURE_OPENAI_KEY} - embed_model: text-embedding-3-small - embed_api_version: "2024-08-01" - openai: - api_key: ${OPENAI_API_KEY} - chat_model: gpt-4o - embed_model: text-embedding-3-small ollama: base_url: http://localhost:11434 chat_model: llama3.2 embed_model: nomic-embed-text - -indexing: - workers: 4 - chunk_size: 512 - batch_size: 32 - -default_project: _default ``` -**No LLM?** Set `provider: none`. The server still runs notes, wikilinks, -graph, tree, and notes-search. Endpoints that need the model -(`POST /api/search`, `POST /api/upload`, `/mcp` tool calls that embed or -extract) return `503 {"code": "llm_disabled"}`. +**No LLM?** Set `provider: none`. The server still runs notes, +wikilinks, graph, tree, and notes-search. Endpoints that need the +model (`POST /api/search`, `POST /api/upload`, `/mcp` tool calls that +embed or extract) return `503 {"code": "llm_disabled"}`. -## Build +## Build from source ```bash # First time on a connected machine @@ -195,6 +205,16 @@ npm --prefix ui test -- --run --coverage npm --prefix ui run build ``` +## Community + +- **Contributing** — [CONTRIBUTING.md](CONTRIBUTING.md) for local dev + loop, test commands, and commit style. +- **Security** — [SECURITY.md](SECURITY.md) for the disclosure policy + and fix SLA. +- **Code of conduct** — [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md). +- **Governance** — [GOVERNANCE.md](GOVERNANCE.md). +- **Changelog** — [CHANGELOG.md](CHANGELOG.md). + ## License MIT. See [LICENSE](LICENSE). diff --git a/SECURITY.md b/SECURITY.md index 4e34329..6ae4c50 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -1,41 +1,104 @@ # Security Policy -## Reporting a Vulnerability +Thanks for helping keep docsiq and its users safe. This document +describes how to report a security issue, what you can expect from us, +and which versions receive fixes. -Please report security vulnerabilities via GitHub's -[private vulnerability reporting](https://github.com/RandomCodeSpace/docsiq/security/advisories/new). +## Reporting a vulnerability -Do **not** open a public issue for security reports. +**Please do not open a public issue.** Use one of the following private +channels: -We aim to acknowledge reports within 72 hours and provide a remediation -plan within 7 days of triage. +1. **GitHub private security advisory** (preferred) — + . + This is the fastest path; the maintainers are notified directly and + the report stays private until a fix ships. +2. **Encrypted email** — if you cannot use GitHub advisories, email the + maintainers with the subject prefix `[SECURITY] docsiq:`. Contact + details are on the project's GitHub profile. PGP keys available on + request. + +When reporting, please include: + +- A description of the issue and its impact. +- Steps to reproduce, ideally with a minimal proof of concept. +- The affected version, commit SHA, and platform. +- Any suggested mitigation or patch you have in mind. + +We will acknowledge your report within **72 hours** and provide a +remediation plan within **7 days of triage**. + +## Disclosure policy + +docsiq follows **coordinated disclosure**. The default embargo window is +**90 days** from the acknowledgement date, during which we will work +with you on a fix, a CVE request (where applicable), and a public +advisory. We are happy to credit you in the advisory — tell us how you +would like to be named. + +If a fix ships before the 90-day window ends, we will publish the +advisory at release time. If we need more time (e.g. upstream dependency +fix required), we will tell you why and propose a revised date. + +Once a fix ships in a release, we publish a +[GitHub Security Advisory](https://github.com/RandomCodeSpace/docsiq/security/advisories) +crediting the reporter unless they request anonymity. + +## Supported versions + +We issue security fixes for: + +- **The latest released tag** on the `main` branch (see + [Releases](https://github.com/RandomCodeSpace/docsiq/releases)). +- **`main` branch HEAD** — security fixes land here first and are + included in the next tagged release. + +docsiq is pre-1.0; older `v0.0.0-beta.N` prereleases are not patched. +Please upgrade to the latest release. + +## Fix SLA + +| Severity | Target fix window | Notes | +|-----------|-------------------|-----------------------------------------------------------------------| +| Critical | 7 days | Remote code execution, auth bypass, data corruption at rest. | +| High | 30 days | Privilege escalation, unauthenticated read of sensitive data. | +| Medium | 90 days | Authenticated flaws with limited blast radius. | +| Low | Best effort | Hardening improvements, defence-in-depth, theoretical issues. | + +These are targets, not guarantees. We will tell you up front if we +cannot meet one and why. ## Scope In scope: -- The `docsiq` binary and all Go packages under `internal/` and `cmd/` -- The embedded React SPA in `ui/` -- The MCP server and REST API exposed by `docsiq serve` -- Build, release, and CI workflows under `.github/` +- The `docsiq` binary and all Go packages under `internal/` and `cmd/`. +- The embedded React SPA in `ui/`. +- The MCP server and REST API exposed by `docsiq serve`. +- Build, release, and CI workflows under `.github/`. +- Default configuration as shipped. +- Vulnerabilities in our direct dependencies that are reachable through + docsiq. Out of scope: - Third-party LLM providers (Azure OpenAI, OpenAI, Ollama) — report - upstream + upstream. +- Upstream vulnerabilities in transitive dependencies that are not + reachable from docsiq. Please report those to the upstream project; + we will track and upgrade when a patched version ships. +- Misconfigurations introduced by a downstream user (e.g. binding a + public port with no API key set). - Vulnerabilities that require a compromised local shell or filesystem - access - -## Supported Versions + access. +- Denial of service via resource exhaustion on a self-hosted instance + the attacker already has network access to. -docsiq is pre-1.0. Only the latest `v0.0.0-beta.N` prerelease receives -security patches. +## Safe harbor -## Disclosure - -We follow coordinated disclosure. Once a fix ships in a release, we -publish a [GitHub Security Advisory](https://github.com/RandomCodeSpace/docsiq/security/advisories) -crediting the reporter unless they request anonymity. +We will not pursue legal action against researchers who act in good +faith, follow this policy, stay within scope, avoid privacy violations, +and do not degrade service for other users. If in doubt, ask first. ## Report archive diff --git a/config.example.yaml b/config.example.yaml index ef758a7..2309a52 100644 --- a/config.example.yaml +++ b/config.example.yaml @@ -1,44 +1,3 @@ -data_dir: ~/.docsiq/data # per-project DBs live at $data_dir/projects//docsiq.db - -llm: - provider: ollama # azure | ollama - - azure: - # Shared defaults — used when chat/embed-specific values are not set. - endpoint: https://myresource.openai.azure.com - api_key: ${AZURE_OPENAI_API_KEY} - api_version: "2024-08-01" - - chat: - model: gpt-4o - # endpoint: ... # optional: override shared endpoint - # api_key: ... # optional: override shared key - # api_version: ... # optional: override shared version - - embed: - model: text-embedding-3-small - # endpoint: ... # optional: separate deployment for embeddings - # api_key: ... # optional: separate key for embeddings - # api_version: ... # optional: separate version for embeddings - - ollama: - base_url: http://localhost:11434 - chat_model: llama3.2 - embed_model: nomic-embed-text - -indexing: - chunk_size: 512 - chunk_overlap: 50 - batch_size: 20 - workers: 4 - extract_graph: true - extract_claims: true - max_gleanings: 1 # gleaning passes for entity extraction (0=single pass) - -community: - min_community_size: 2 - max_levels: 3 - -server: - host: 127.0.0.1 - port: 8080 +# This file has moved to configs/docsiq.example.yaml. +# See that file for a fully annotated reference of every option, +# default value, and environment-variable override. diff --git a/configs/docsiq.example.yaml b/configs/docsiq.example.yaml new file mode 100644 index 0000000..70d2c67 --- /dev/null +++ b/configs/docsiq.example.yaml @@ -0,0 +1,182 @@ +# docsiq example configuration +# +# Copy this file to one of the locations docsiq checks on startup: +# - ~/.docsiq/config.yaml (global, per-user) +# - ./config.yaml (current working directory) +# +# Every key listed here can be overridden by an environment variable. +# The rule is: prefix with DOCSIQ_, replace dots with underscores, +# uppercase everything. For example: +# server.port → DOCSIQ_SERVER_PORT +# llm.azure.chat.endpoint → DOCSIQ_LLM_AZURE_CHAT_ENDPOINT +# indexing.workers → DOCSIQ_INDEXING_WORKERS +# +# Two convenience aliases exist: +# DOCSIQ_API_KEY → server.api_key +# DOCSIQ_SERVER_API_KEY → server.api_key +# +# Fields are documented with: (default), purpose, env var. + +# --------------------------------------------------------------------------- +# Storage +# --------------------------------------------------------------------------- + +# Root directory for the per-project SQLite stores and notes. +# Default: ~/.docsiq/data +# Env: DOCSIQ_DATA_DIR +data_dir: ~/.docsiq/data + +# The project slug used when a request does not specify ?project= or +# X-Project. Must match a slug registered via `docsiq projects register`. +# Default: _default +# Env: DOCSIQ_DEFAULT_PROJECT +default_project: _default + +# --------------------------------------------------------------------------- +# LLM provider +# --------------------------------------------------------------------------- + +llm: + # One of: azure | openai | ollama | none + # "none" disables all LLM-backed endpoints; notes/graph still work. + # Default: ollama + # Env: DOCSIQ_LLM_PROVIDER + provider: ollama + + # Azure OpenAI. Shared endpoint/api_key/api_version apply unless the + # chat.* or embed.* sub-blocks override them. + azure: + # Shared defaults — leave empty if you configure chat/embed separately. + endpoint: "" # Env: DOCSIQ_LLM_AZURE_ENDPOINT + api_key: "" # Env: DOCSIQ_LLM_AZURE_API_KEY + api_version: "2024-08-01" # Env: DOCSIQ_LLM_AZURE_API_VERSION + + chat: + endpoint: "" # Env: DOCSIQ_LLM_AZURE_CHAT_ENDPOINT + api_key: "" # Env: DOCSIQ_LLM_AZURE_CHAT_API_KEY + api_version: "" # Env: DOCSIQ_LLM_AZURE_CHAT_API_VERSION + model: gpt-4o # Env: DOCSIQ_LLM_AZURE_CHAT_MODEL + + embed: + endpoint: "" # Env: DOCSIQ_LLM_AZURE_EMBED_ENDPOINT + api_key: "" # Env: DOCSIQ_LLM_AZURE_EMBED_API_KEY + api_version: "" # Env: DOCSIQ_LLM_AZURE_EMBED_API_VERSION + model: text-embedding-3-small + # Env: DOCSIQ_LLM_AZURE_EMBED_MODEL + + # Direct OpenAI (api.openai.com), distinct from Azure OpenAI. + openai: + # Required when provider is "openai". Use env, not the YAML. + api_key: "" # Env: DOCSIQ_LLM_OPENAI_API_KEY + + # For custom proxies or gateways; leave as default for api.openai.com. + base_url: https://api.openai.com/v1 # Env: DOCSIQ_LLM_OPENAI_BASE_URL + + # Chat completion model. + chat_model: gpt-4o-mini # Env: DOCSIQ_LLM_OPENAI_CHAT_MODEL + + # Embedding model for vector search. + embed_model: text-embedding-3-small # Env: DOCSIQ_LLM_OPENAI_EMBED_MODEL + + # Optional OpenAI-Organization header for billing routing. + organization: "" # Env: DOCSIQ_LLM_OPENAI_ORGANIZATION + + # Ollama (self-hosted). Default when nothing else is configured. + ollama: + base_url: http://localhost:11434 # Env: DOCSIQ_LLM_OLLAMA_BASE_URL + chat_model: llama3.2 # Env: DOCSIQ_LLM_OLLAMA_CHAT_MODEL + embed_model: nomic-embed-text # Env: DOCSIQ_LLM_OLLAMA_EMBED_MODEL + +# --------------------------------------------------------------------------- +# Per-project LLM overrides (YAML only — env vars cannot nest like this) +# --------------------------------------------------------------------------- +# llm_overrides: +# my-project-slug: +# provider: openai +# openai: +# chat_model: gpt-4o +# another-slug: +# provider: ollama +# ollama: +# chat_model: mistral + +# --------------------------------------------------------------------------- +# Indexing pipeline +# --------------------------------------------------------------------------- + +indexing: + # Target size (in characters, not tokens) per chunk. + # Default: 512. Env: DOCSIQ_INDEXING_CHUNK_SIZE + chunk_size: 512 + + # Overlap between successive chunks, in characters. + # Default: 50. Env: DOCSIQ_INDEXING_CHUNK_OVERLAP + chunk_overlap: 50 + + # How many chunks to embed per LLM batch request. + # Default: 20. Env: DOCSIQ_INDEXING_BATCH_SIZE + batch_size: 20 + + # Parallel workers for document-level operations (load → chunk → embed). + # Default: 4. Env: DOCSIQ_INDEXING_WORKERS + workers: 4 + + # Run entity + relationship extraction during indexing. + # Set false to build vector-only index (faster, no graph queries). + # Default: true. Env: DOCSIQ_INDEXING_EXTRACT_GRAPH + extract_graph: true + + # Extract covariates / claims associated with entity mentions. + # Default: true. Env: DOCSIQ_INDEXING_EXTRACT_CLAIMS + extract_claims: true + + # Number of "continue extracting" passes over the same chunk. Higher + # values recover more entities at the cost of LLM tokens. + # Default: 1. Env: DOCSIQ_INDEXING_MAX_GLEANINGS + max_gleanings: 1 + +# --------------------------------------------------------------------------- +# Community detection +# --------------------------------------------------------------------------- + +community: + # Minimum number of nodes in a detected community before it's reported. + # Default: 2. Env: DOCSIQ_COMMUNITY_MIN_COMMUNITY_SIZE + min_community_size: 2 + + # Depth of hierarchical Louvain clustering. Higher → more layers of + # nested communities, at the cost of longer indexing time. + # Default: 3. Env: DOCSIQ_COMMUNITY_MAX_LEVELS + max_levels: 3 + +# --------------------------------------------------------------------------- +# HTTP server (API + UI + MCP) +# --------------------------------------------------------------------------- + +server: + # Bind address. Use 0.0.0.0 for LAN access. 127.0.0.1 binds loopback. + # Default: 127.0.0.1. Env: DOCSIQ_SERVER_HOST + host: 127.0.0.1 + + # Listen port. + # Default: 8080. Env: DOCSIQ_SERVER_PORT + port: 8080 + + # If set, every API + MCP request must carry + # "Authorization: Bearer ". Leave empty to disable. + # Default: "". Env: DOCSIQ_SERVER_API_KEY (alias: DOCSIQ_API_KEY) + api_key: "" + + # Maximum upload size in bytes for POST /api/upload. 0 or negative + # disables the cap (not recommended). + # Default: 104857600 (100 MiB). Env: DOCSIQ_SERVER_MAX_UPLOAD_BYTES + max_upload_bytes: 104857600 + + # Number of background workers servicing the indexing work queue. + # 0 → runtime.NumCPU(). + # Default: 0. Env: DOCSIQ_SERVER_WORKQ_WORKERS + workq_workers: 0 + + # Maximum queued jobs before /api/upload starts returning 429. + # Default: 64. Env: DOCSIQ_SERVER_WORKQ_DEPTH + workq_depth: 64 diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000..678a248 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,130 @@ +# Quickstart + +Go from zero to a queryable knowledge graph in under three minutes. + +## What you'll do + +1. Install the `docsiq` binary. +2. Register the current directory as a docsiq project. +3. Index a small sample corpus of three markdown documents. +4. Ask a question. +5. Open the UI and see the graph. + +The sample corpus lives at [`docs/samples/`](samples/); it's three +short markdown files about Roman aqueducts, GraphRAG, and Louvain +community detection. Small enough to index in ~30 seconds, dense enough +to produce interesting entities and a multi-community graph. + +## 1. Install + +Download the latest release for your platform. Replace +`docsiq-linux-amd64` with the asset name matching your OS if needed +(macOS arm64, Windows amd64 assets are published alongside). + +```bash +curl -LO https://github.com/RandomCodeSpace/docsiq/releases/latest/download/docsiq-linux-amd64 +chmod +x docsiq-linux-amd64 +mv docsiq-linux-amd64 ~/.local/bin/docsiq # or any directory on your PATH +``` + +Verify: + +```bash +docsiq version +``` + +Building from source is also supported and takes about a minute +end-to-end; see [CONTRIBUTING.md](../CONTRIBUTING.md) for the build +instructions. + +## 2. Register a project + +```bash +cd ~/path/to/any/directory # or stay in the docsiq repo for the demo +docsiq init +``` + +`docsiq init` registers the current directory as a project and creates a +scope-specific SQLite store at `~/.docsiq/data/projects//`. If +you're in a git repo, the slug is derived from the repo's remote origin; +otherwise you'll be prompted for a name. + +## 3. Index the sample corpus + +From the repository root (so that `docs/samples/` resolves): + +```bash +docsiq index docs/samples/ +``` + +You will see log lines for each phase: + +``` +⚙️ loaded config file path=/home/you/.docsiq/config.yaml +📄 loading documents count=3 +🧩 chunking chunks=12 +🌐 embedding batches=1 +🔗 extracting entities entities=18 relationships=24 +🧩 detecting communities levels=3 communities=5 +✅ index complete duration=21.4s +``` + +If you are running without an LLM configured +(`DOCSIQ_LLM_PROVIDER=none` or `llm.provider: none` in the config), +entity extraction and embedding steps are skipped; you'll still get a +keyword-searchable corpus and a notes graph. + +## 4. Ask a question + +```bash +docsiq search "Who built the first Roman aqueduct?" +``` + +Expected (with an LLM configured): + +``` +Answer: Appius Claudius Caecus built the first Roman aqueduct, the +Aqua Appia, in 312 BCE in his role as censor. + +Sources: + roman-aqueducts.md (chunk 0) +``` + +For a corpus-scale question, try: + +```bash +docsiq search "What are the main themes in this corpus?" +``` + +This triggers the global search path, which consults community +summaries rather than individual chunks. + +## 5. Open the UI + +```bash +docsiq serve +# → http://localhost:8080 +``` + +Navigate to `http://localhost:8080`. You should see: + +- **Home** — project picker, recent indexing activity. +- **Notes** — wikilinked markdown, even without any LLM configured. +- **Documents** — the three sample files with chunk counts. +- **Graph** — force-directed entity/community visualisation. +- **MCP** — inspector-style console for the 12+ MCP tools docsiq + exposes at `/mcp`. + +Screenshots of each view are in [`docs/screenshots/`](screenshots/). + +## Where to next + +- **Configure an LLM** — see [`configs/docsiq.example.yaml`](../configs/docsiq.example.yaml) + for every option, default, and env-var override. +- **Integrate with Claude Desktop / Cursor** — run + `docsiq hooks install --client claude-desktop`. +- **Index a real corpus** — `docsiq index /path/to/your/docs` accepts + PDF, DOCX, TXT, and Markdown. Web pages can be fetched with + `docsiq crawl `. +- **Read the architecture overview** — [README.md](../README.md#architecture). +- **Contribute** — [CONTRIBUTING.md](../CONTRIBUTING.md). diff --git a/docs/samples/README.md b/docs/samples/README.md new file mode 100644 index 0000000..4ff7d7a --- /dev/null +++ b/docs/samples/README.md @@ -0,0 +1,12 @@ +# Sample corpus + +These tiny markdown documents exist to: + +1. Back the [quickstart](../quickstart.md) — a user can index this + directory in under 30 seconds and ask a meaningful question. +2. Populate the screenshots in [../screenshots/](../screenshots/) with + realistic but non-proprietary data. + +The corpus is intentionally tiny. For a real workload, point `docsiq +index` at a folder of your actual documents (PDF, DOCX, TXT, MD, or the +output of `docsiq crawl`). diff --git a/docs/samples/graphrag.md b/docs/samples/graphrag.md new file mode 100644 index 0000000..580c808 --- /dev/null +++ b/docs/samples/graphrag.md @@ -0,0 +1,19 @@ +# GraphRAG + +GraphRAG is a retrieval-augmented generation (RAG) technique introduced +by Microsoft Research in 2024 that augments vector search with an +explicit knowledge graph. Instead of retrieving only the top-k chunks +semantically similar to a query, GraphRAG extracts entities, relations, +and claims from each chunk, builds a graph, runs Louvain community +detection on it, and then serves queries against either local (entity +neighbourhood) or global (community summary) views. + +The key claim is that graph-derived structure recovers global context +that pure vector search cannot: "who are the main actors in this +corpus", "what are the dominant themes", questions that require a view +of the whole rather than a handful of passages. + +docsiq is a Go implementation of this technique, shipping as a single +binary with an embedded React UI. It supports Azure OpenAI, OpenAI, and +Ollama as LLM providers, storing everything in SQLite with FTS5 and the +sqlite-vec extension for ANN vector search. diff --git a/docs/samples/louvain.md b/docs/samples/louvain.md new file mode 100644 index 0000000..3561882 --- /dev/null +++ b/docs/samples/louvain.md @@ -0,0 +1,17 @@ +# Louvain community detection + +Louvain is a greedy algorithm for community detection in large networks, +published by Blondel, Guillaume, Lambiotte, and Lefebvre in 2008. It +optimises modularity — a scalar that rewards dense within-community +links and sparse between-community links — through two alternating +phases: local move (every node is considered for reassignment to the +community that most increases modularity) and aggregation (each detected +community becomes a super-node in the next iteration). + +The algorithm terminates when no move increases modularity. Runtime is +roughly linear in the number of edges, which is why Louvain scales to +graphs of tens of millions of nodes where spectral methods do not. + +GraphRAG (see graphrag.md) uses Louvain to partition its entity graph +into nested communities. Each community gets an LLM-generated summary, +which is what the "global" search mode retrieves. diff --git a/docs/samples/roman-aqueducts.md b/docs/samples/roman-aqueducts.md new file mode 100644 index 0000000..544f331 --- /dev/null +++ b/docs/samples/roman-aqueducts.md @@ -0,0 +1,19 @@ +# Roman Aqueducts + +Roman aqueducts were a network of structures that supplied water to +cities across the Roman Empire. The first, the Aqua Appia, was +constructed in 312 BCE under the censor Appius Claudius Caecus. By the +first century CE, eleven major aqueducts served Rome, delivering an +estimated one million cubic metres of water per day. + +Gravity, not pumping, moved the water. Engineers held the gradient +between 1:200 and 1:500 over distances exceeding a hundred kilometres, +using arcades to bridge valleys and inverted siphons where terrain +required pressure flow. + +The Pont du Gard in modern-day France is the most famous surviving +example; the Aqua Claudia and Anio Novus, both completed in 52 CE under +the emperor Claudius, are the longest. Maintenance was the +responsibility of the curator aquarum — the water commissioner — an +office held by Sextus Julius Frontinus, whose treatise *De Aquaeductu* +remains the principal source on the subject. diff --git a/docs/screenshots/documents.png b/docs/screenshots/documents.png new file mode 100644 index 0000000..2332a8e Binary files /dev/null and b/docs/screenshots/documents.png differ diff --git a/docs/screenshots/graph.png b/docs/screenshots/graph.png new file mode 100644 index 0000000..38c19e9 Binary files /dev/null and b/docs/screenshots/graph.png differ diff --git a/docs/screenshots/home.png b/docs/screenshots/home.png new file mode 100644 index 0000000..585e78e Binary files /dev/null and b/docs/screenshots/home.png differ diff --git a/docs/screenshots/mcp.png b/docs/screenshots/mcp.png new file mode 100644 index 0000000..f76bb9e Binary files /dev/null and b/docs/screenshots/mcp.png differ diff --git a/docs/screenshots/notes.png b/docs/screenshots/notes.png new file mode 100644 index 0000000..290844c Binary files /dev/null and b/docs/screenshots/notes.png differ diff --git a/ui/e2e/screenshots.spec.ts b/ui/e2e/screenshots.spec.ts new file mode 100644 index 0000000..16d92a5 --- /dev/null +++ b/ui/e2e/screenshots.spec.ts @@ -0,0 +1,86 @@ +// Capture the five canonical docsiq screenshots used in the README and +// docs/screenshots/. Runs against a live server on $BASE_URL (default +// http://localhost:37778) seeded with the docs/samples/ fixture corpus. +// +// Usage: +// DOCSIQ_DATA_DIR=/tmp/fixture ./docsiq serve --port 37778 & +// BASE_URL=http://localhost:37778 \ +// npx playwright test ui/e2e/screenshots.spec.ts + +import { test } from "@playwright/test"; +import path from "node:path"; +import { fileURLToPath } from "node:url"; + +const BASE_URL = process.env.BASE_URL ?? "http://localhost:37778"; +const __filename = fileURLToPath(import.meta.url); +const __dirname = path.dirname(__filename); +const OUT_DIR = path.resolve(__dirname, "..", "..", "docs", "screenshots"); + +// Desktop viewport. Matches the typical reviewer's Retina display; +// the sharp script downscales as needed. +test.use({ + viewport: { width: 1440, height: 900 }, + deviceScaleFactor: 2, + colorScheme: "dark", // dark theme is the default in the app +}); + +async function settle(page: import("@playwright/test").Page) { + // Wait for network idle + a small buffer for any d3 transitions. + await page.waitForLoadState("networkidle"); + await page.waitForTimeout(500); +} + +test("home", async ({ page }) => { + await page.goto(`${BASE_URL}/`); + await settle(page); + await page.screenshot({ + path: path.join(OUT_DIR, "home.png"), + fullPage: true, + }); +}); + +test("notes", async ({ page }) => { + await page.goto(`${BASE_URL}/notes`); + await settle(page); + await page.screenshot({ + path: path.join(OUT_DIR, "notes.png"), + fullPage: true, + }); +}); + +test("documents", async ({ page }) => { + await page.goto(`${BASE_URL}/docs`); + await settle(page); + await page.screenshot({ + path: path.join(OUT_DIR, "documents.png"), + fullPage: true, + }); +}); + +test("graph", async ({ page }) => { + await page.goto(`${BASE_URL}/graph`); + await settle(page); + // Graph has an SVG force simulation — give it a bit longer to settle. + await page.waitForTimeout(2000); + await page.screenshot({ + path: path.join(OUT_DIR, "graph.png"), + fullPage: true, + }); +}); + +test("mcp", async ({ page }) => { + // /mcp is claimed by the server (MCP over HTTP). Load the SPA shell first, + // then navigate via the client-side router so we render the MCP Console UI + // instead of hitting the server's 503/JSON handler directly. + await page.goto(`${BASE_URL}/`); + await settle(page); + await page.evaluate(() => { + window.history.pushState({}, "", "/mcp"); + window.dispatchEvent(new PopStateEvent("popstate")); + }); + await settle(page); + await page.screenshot({ + path: path.join(OUT_DIR, "mcp.png"), + fullPage: true, + }); +}); diff --git a/ui/scripts/optimize-screenshots.mjs b/ui/scripts/optimize-screenshots.mjs new file mode 100644 index 0000000..33066ec --- /dev/null +++ b/ui/scripts/optimize-screenshots.mjs @@ -0,0 +1,46 @@ +// Optimise docs/screenshots/*.png in place via sharp. +// +// Targets: +// - Keep pixel dimensions (don't downscale — we use @2x captures). +// - Re-encode PNG with compression level 9 and palette where possible. +// - Refuse to exit 0 if any resulting file is > 500 KB — that's our +// published budget per screenshot. +// +// Usage: node ui/scripts/optimize-screenshots.mjs + +import { readdir, stat } from "node:fs/promises"; +import path from "node:path"; +import sharp from "sharp"; + +const DIR = path.resolve( + path.dirname(new URL(import.meta.url).pathname), + "..", + "..", + "docs", + "screenshots", +); +const BUDGET_BYTES = 500 * 1024; + +const entries = (await readdir(DIR)).filter((f) => f.endsWith(".png")); +if (entries.length === 0) { + console.error("no PNGs found in", DIR); + process.exit(1); +} + +let failed = false; +for (const name of entries) { + const p = path.join(DIR, name); + const before = (await stat(p)).size; + const buf = await sharp(p) + .png({ compressionLevel: 9, palette: true, effort: 10 }) + .toBuffer(); + await sharp(buf).toFile(p); + const after = (await stat(p)).size; + const flag = after > BUDGET_BYTES ? "OVER BUDGET" : "ok"; + console.log( + `${name.padEnd(18)} ${Math.round(before / 1024)} KB → ${Math.round(after / 1024)} KB [${flag}]`, + ); + if (after > BUDGET_BYTES) failed = true; +} + +process.exit(failed ? 2 : 0);