feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support by suryaiyer95 · Pull Request #493 · AltimateAI/altimate-code

suryaiyer95 · 2026-03-27T00:29:44Z

Summary

Adds the full data-parity diffing pipeline for the `data_diff` LLM tool: a TypeScript orchestrator that bridges the Rust `DataParitySession` state machine with all database drivers, plus a comprehensive SKILL.md that guides the model through algorithm selection.

Core: TypeScript orchestrator (`data-diff.ts`, +799 lines)

Cooperative state machine bridge — TypeScript executes SQL tasks generated by the Rust engine and feeds results back. Neither layer touches the other's domain.

`runDataDiff` — top-level entry point; dispatches to single-table or partitioned diff
`runPartitionedDiff` — runs independent diffs per partition with `mergeOutcomes`
`resolveTableSources` + `injectCte` — compares arbitrary SQL queries by wrapping them as CTEs
`discoverExtraColumns` — auto-discovers non-key columns with two-layer audit exclusion (name patterns + schema-level DEFAULT detection across 9 dialects)
`buildColumnDiscoverySQL` — per-dialect INFORMATION_SCHEMA query (Postgres, Snowflake, BigQuery, MySQL, Redshift, SQLite, DuckDB, Trino/Presto, fallback)
`dateTruncExpr` — dialect-aware date truncation for date-partition modes (BigQuery, MySQL/MariaDB, Oracle, Spark/Databricks, default `DATE_TRUNC`)
`buildPartitionDiscoverySQL` / `buildPartitionWhereClause` — date, numeric, and categorical partition support

New: ClickHouse driver (`clickhouse.ts`, +135 lines)

Pure HTTP driver using `JSONCompactEachRowWithNamesAndTypes` — no external package required. Handles DDL (empty response) vs SELECT (typed rows) branches correctly.

`ExecuteOptions` — all 10+ drivers updated

New `noLimit?: boolean` flag bypasses the default 1,000-row cap for data-diff pipelines. Applied consistently across Postgres, Snowflake, BigQuery, Redshift, MySQL, SQLite, DuckDB, Trino, Databricks, ClickHouse.

SKILL.md (+341 lines)

Comprehensive algorithm-selection guide:

CRITICAL: `joindiff` cannot see a second table across warehouses — always use `hashdiff` or `auto` for cross-DB comparisons
Decision tree: same-warehouse vs cross-warehouse, table vs query comparison
Output format: facts only (counts, values, row samples) — no editorializing
Full parameter reference for `DataDiffParams`

Fixes applied (from code review)

Security — `buildColumnDiscoverySQL` now escapes single quotes in all interpolated table name parts (`esc()`) — prevents SQL injection via crafted source/target table names
Bug — `dateTruncExpr` Oracle case added: `TRUNC(col, 'MONTH')` — Oracle does not have `DATE_TRUNC`

Test plan

Cross-DB comparison (pg_source vs pg_target) — agent uses `hashdiff` automatically
TPC-H migration validation — output is clean fact-reporting
SQL query comparison (same-warehouse) — `joindiff` used correctly
TypeScript typecheck — passing

Summary by CodeRabbit

New Features
- New Data Parity tool for table vs table and SQL vs SQL comparisons with profiling, column/row diffs, and human-readable results.
- Partitioned diff execution to handle large datasets with per-partition summaries.
- Automatic detection/exclusion of audit/timestamp columns during comparisons.
Documentation
- Added a comprehensive data parity skill guide and updated prompts to invoke the feature.
Other
- Query execution now supports a “no-limit” mode to run untruncated comparisons.

coderabbitai · 2026-03-27T00:29:52Z

📝 Walkthrough

Walkthrough

Adds an end-to-end “data parity” feature: new skill docs and prompts, a DataDiffTool, a dispatcher handler that runs a Rust DataParity state machine via a new orchestration module, driver updates to support no-limit execution, types for data-diff contracts, and test mock adjustments.

Changes

Cohort / File(s)	Summary
Drivers (execute options) `packages/drivers/src/bigquery.ts`, `packages/drivers/src/clickhouse.ts`, `packages/drivers/src/databricks.ts`, `packages/drivers/src/duckdb.ts`, `packages/drivers/src/mysql.ts`, `packages/drivers/src/oracle.ts`, `packages/drivers/src/postgres.ts`, `packages/drivers/src/redshift.ts`, `packages/drivers/src/snowflake.ts`, `packages/drivers/src/sqlite.ts`, `packages/drivers/src/sqlserver.ts`	Added optional `options?: ExecuteOptions` to `Connector.execute`; introduced `noLimit` handling (effectiveLimit = 0 when set) and adjusted LIMIT injection/truncation logic to only apply when effectiveLimit > 0.
Driver types `packages/drivers/src/types.ts`	Added `ExecuteOptions` interface (`noLimit?: boolean`) and updated `Connector.execute` signature to accept `options?: ExecuteOptions`.
Data-diff orchestration `packages/opencode/src/altimate/native/connections/data-diff.ts`	New orchestration module that runs the DataParity Rust engine: resolves query-vs-table sources (CTE rewriting), injects CTEs into emitted SQL, executes SQL with no-limit, normalizes results, auto-discovers schema/auto-audit columns (dialect-aware), supports partitioned diffs (partition discovery, per-partition recursive diffing, merge outcomes), and returns structured results. Exposes `resolveTableSources`, `injectCte`, and `runDataDiff`.
Dispatcher registration `packages/opencode/src/altimate/native/connections/register.ts`	Registered dispatcher handler for `"data.diff"` delegating to `runDataDiff`.
Bridge types / contracts `packages/opencode/src/altimate/native/types.ts`	Added `DataDiffParams`, `PartitionDiffResult`, `DataDiffResult`, and extended `BridgeMethods` with `"data.diff"`.
Tooling & UI `packages/opencode/src/altimate/tools/data-diff.ts`, `packages/opencode/src/tool/registry.ts`	Added `DataDiffTool` that calls dispatcher, formats human-readable summaries (including partition breakdowns and auto-excluded audit columns); registered tool in `ToolRegistry`.
Skill docs & prompts `.opencode/skills/data-parity/SKILL.md`, `packages/opencode/src/altimate/prompts/builder.txt`	Added comprehensive data-parity skill doc (workflow, profiling-before-diff, partition guidance, algorithm rules, reporting requirements) and prompt builder entries to invoke `/data-parity`.
Tests `packages/opencode/test/altimate/simulation-suite.test.ts`	Adjusted test mocks by flattening previously nested `data` wrappers and adding `source_dialect`/`target_dialect` fields in sql-translate mock.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Client
    participant Tool as DataDiffTool
    participant Dispatcher as Dispatcher
    participant Orch as DataDiff Orchestration
    participant Engine as Rust DataParity Engine
    participant DB as Warehouse Connections

    User->>Tool: invoke data_diff(params)
    Tool->>Dispatcher: call("data.diff", params)
    Dispatcher->>Orch: runDataDiff(params)
    Orch->>Orch: resolveTableSources() / build CTEs
    Orch->>DB: fetch schema & partition values
    DB-->>Orch: column metadata / partition list
    Orch->>Engine: create DataParitySession
    loop Engine emits actions
      Engine-->>Orch: ExecuteSql(sql)
      Orch->>Orch: injectCte(sql)
      Orch->>DB: execute SQL (noLimit=true)
      DB-->>Orch: rows (normalized)
      Orch->>Engine: push results
    end
    alt partitioned
      Orch->>Orch: iterate partitions -> runDataDiff(partitioned_params)
      Orch->>Orch: mergeOutcomes(per_partition_results)
    end
    Engine-->>Orch: Done / Outcome
    Orch-->>Dispatcher: DataDiffResult
    Dispatcher-->>Tool: result
    Tool-->>User: formatted summary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix: address 7 P1 findings from v0.5.16 release evaluation #591: Overlapping edits to ClickHouse driver execute signature and LIMIT/truncation handling.
feat: add ClickHouse warehouse driver #574: Related ClickHouse initialization/execute logic changes impacting the same driver file.
fix: defensive null guards in tool formatters and DuckDB concurrent access retry #571: Related changes touching the DuckDB driver execute behavior.

Suggested reviewers

mdesmet

Poem

🐰 I hopped through CTEs, partitions in tow,
Found matching rows where differences grow.
No limits to fetch, no truncation to fear,
Keys confirmed, profiles run—results clear.
A rabbit's diff danced—data parity, cheer!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.94% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: adding a data-parity skill with a TypeScript orchestrator, ClickHouse driver support, and partition support.
Description check	✅ Passed	The description includes all required sections (Summary, Test Plan, Checklist) with comprehensive detail. Summary explains core changes and fixes; Test Plan lists validation approaches; Checklist items are marked as completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/data-parity-skill-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

suryaiyer95 · 2026-03-27T00:32:58Z

Closing — .opencode/ skill config and model defaults should not live in the open source repo.

aidtya · 2026-03-28T08:18:07Z

Companion fix: column name collision in `join_diff_sql`

The joindiff algorithm produces incorrect results for compound keys due to a column name collision between the Rust SQL generator and the node-postgres driver.

Root cause: join_diff_sql emits unaliased COALESCE and CAST expressions. PostgreSQL auto-assigns duplicate column names (e.g., two columns both named coalesce). node-postgres converts rows to JS objects keyed by column name — duplicates silently overwrite earlier values, so the engine receives corrupted key data.

Example: Comparing rakuten.warehouse_metadata vs rakutenvthree.warehouse_metadata with key_columns: ["rk", "instance_id"], the rk column was lost (overwritten by instance_id), collapsing 460 distinct keys into 2 groups. Tool reported only_in_source = 1 instead of the correct 307.

Fix: https://github.com/AltimateAI/altimate-core-internal/pull/114 — every SELECT expression in join_diff_sql now gets a unique alias (_k0, _k1, _v0_l, _v0_r, …). No changes needed in this PR's orchestrator code — it already uses positional indexing.

To pick up the fix: bump @altimateai/altimate-core in packages/opencode/package.json after the altimate-core-internal binary is rebuilt.

suryaiyer95

Multi-Model Code Review — altimate-code#493

Reviewed by 4-model panel: Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4, Kimi K2.5. Unanimous verdict: REQUEST CHANGES.

Note: The PR title significantly undersells scope. This contains a 799-line TypeScript orchestrator, new ClickHouse driver, `ExecuteOptions` refactored across 10+ drivers, column auto-discovery, partition support, and a 341-line SKILL.md.

Verdict: REQUEST CHANGES (fix MAJOR-1 before merge)
Major Issues: 1 | Minor Issues: 4 | NITs: 3

Major Issues

MAJOR-1 — SQL injection via table name interpolation in `buildColumnDiscoverySQL` (Security)

Location: `packages/opencode/src/altimate/native/connections/data-diff.ts` — `buildColumnDiscoverySQL`
Table name parts are interpolated directly into SQL:
```
schemaFilter = `table_schema = '${parts[1]}'`
tableFilter  = `table_name  = '${parts[2]}'`
```
`tableName` flows from `DataDiffParams.source` which comes from the LLM/user. A crafted table name like `orders' OR 1=1 --` produces injected SQL.

Fix:

const escape = (s: string) => s.replace(/'/g, "''")
schemaFilter = `table_schema = '${escape(parts[1])}'`
tableFilter  = `table_name  = '${escape(parts[2])}'`

Minor Issues

MINOR-1 — `discoverExtraColumns` silently falls back to key-only comparison (Logic Error)

When column discovery fails, the diff runs in key-only mode — reporting all rows as identical even when non-key values differ. No warning is surfaced.
Fix: Include a `warning` field in `DataDiffResult` when key-only fallback occurs.

MINOR-2 — `dateTruncExpr` missing Oracle dialect (Bug)

Oracle falls through to the default `DATE_TRUNC` case. Oracle uses `TRUNC(col, 'MONTH')` — date-partitioned diffs on Oracle tables will fail.
Fix: Add `case "oracle": return `TRUNC(${column}, '${g.toUpperCase()}')``

MINOR-3 — Unqualified table names match across multiple schemas (Bug)

`buildColumnDiscoverySQL` with no schema filter returns columns from all schemas with that table name — producing incorrect `extra_columns`.
Fix: Default `schemaFilter` to the connection's current schema, or document that fully-qualified names are required.

MINOR-4 — `MAX_STEPS = 200` hardcoded and undocumented (Design)

Not configurable, not documented in `DataDiffParams` or tool description.
Fix: Expose as optional `max_steps?: number` in `DataDiffParams`, defaulting to 200.

NITs

N1: `mergeOutcomes` caps `diff_rows` at 100 with no documentation — callers don't know the result is partial
N2: `partitionColumn` identifier interpolated directly in `buildPartitionDiscoverySQL` — inconsistent with MAJOR-1 fix
N3: `effectiveLimit` pattern repeated across 10+ drivers

Positive Observations

The cooperative state machine bridge (TypeScript orchestrator ↔ Rust engine) is the right architecture.
Column auto-discovery with two-layer audit exclusion is thorough.
`ExecuteOptions.noLimit` correctly applied across all 10+ drivers.
SKILL.md `CRITICAL: joindiff only sees one connection for cross-warehouse comparisons` is exactly the right guardrail.
ClickHouse driver using `JSONCompactEachRowWithNamesAndTypes` with pure HTTP is clean.
Dynamic import gracefully handles NAPI unavailability — unblocks CI.

suryaiyer95

Code Review: Data Parity TypeScript Orchestrator + ClickHouse Driver

Verdict: APPROVE with minor issues
Reviewed by: Claude (sole reviewer — external models unavailable)

Major Issue (1)

1. SQL Injection via `buildColumnDiscoverySQL` — ClickHouse/Snowflake paths (Security)

Location: packages/opencode/src/altimate/native/connections/data-diff.ts

The ClickHouse path uses DESCRIBE TABLE ${tableName} and Snowflake uses SHOW COLUMNS IN TABLE ${tableName} with raw string interpolation. While the information_schema paths properly escape with esc(), ClickHouse/Snowflake bypass escaping entirely.

A table name like orders; DROP TABLE users -- would be injected verbatim. The tableName comes from LLM tool params, so prompt injection or user error could trigger this.

Fix: Apply identifier quoting (backtick for ClickHouse, double-quote for Snowflake) or validate table names against [a-zA-Z0-9_.].

Minor Issues (4)

2. Profile outcome formatting uses wrong field names — `tools/data-diff.ts`

The TS formatter checks col.source_stats/col.target_stats but Rust serializes as col.table1/col.table2. Stats fields are non_null_count/distinct_count in Rust but the TS uses count/null_count. Profile output would show undefined values.

3. CTE injection doesn't handle ClickHouse `SETTINGS` clause — `data-diff.ts:injectCte()`

JoinDiff for ClickHouse emits ... SETTINGS join_use_nulls = 1. CTE-wrapped queries with SETTINGS may not propagate correctly in all ClickHouse versions.

4. Partition discovery fails on query sources — `data-diff.ts:runPartitionedDiff()`

When source is a SQL query, table1Name resolves to __diff_source but the partition discovery SQL doesn't include the CTE prefix. This would fail with "table __diff_source not found".

5. `noLimit` creates unbounded memory risk for JoinDiff — all 11 driver files

When noLimit is true, no LIMIT clause is added. For JoinDiff, the result set could be millions of rows loaded into Node.js memory. Consider a safety limit (e.g., 10M rows) even with noLimit.

Nit (1)

simulation-suite.test.ts changes (removing data: wrapper, flattening mock response fields) are unrelated bug fixes that should be in a separate commit

Positive Observations

Auto-discovery of audit columns with two-layer exclusion (name patterns + schema default detection) is production-quality. The SKILL.md workflow requiring user confirmation before excluding is excellent UX
CTE injection for query-vs-query comparison is clever — wrapping arbitrary SQL in CTEs so the Rust engine treats them as tables
Partition support (date/numeric/categorical) with per-partition breakdown reporting makes this usable on billion-row tables
SKILL.md is thorough — the 9-step plan template, algorithm selection guide, extra_columns behavior docs, and common mistakes section are exactly what an LLM agent needs
Consistent noLimit across all 11 drivers — clean, mechanical refactor with correct effectiveLimit > 0 guard for truncation detection
Defensive null guards added to lineage-check.ts, schema-inspect.ts, sql-analyze.ts — good hardening of existing tools

Missing Tests

No test for injectCte merging engine CTEs with source/target CTEs
No test for partition discovery failure (error propagation from runPartitionedDiff)
No integration test exercising DataParitySession NAPI round-trip from TypeScript

Reviewed by 1 model: Claude. External models (Gemini 3.1 Pro, Kimi K2.5, Grok 4) unavailable due to insufficient OpenRouter credits.

🤖 Generated with Claude Code

suryaiyer95 · 2026-04-01T20:55:29Z

Supplemental Review Findings

Two additional issues caught during deeper analysis:

MAJOR: Partition discovery misses target-only partitions (`data-diff.ts`)

runPartitionedDiff discovers partition values only from the source table. If the target has partitions not present in source (e.g., target has December data, source doesn't), those rows are silently missed — they never get diffed.

Fix: UNION partition values from both source and target tables before iterating.

MAJOR: `buildPartitionDiscoverySQL` — unquoted column injection

partitionColumn from LLM tool args is injected unquoted into SELECT DISTINCT ${expr} AS _p FROM ${table}. The buildPartitionWhereClause function correctly quotes with "${partitionColumn}" but the discovery query doesn't.

MINOR: `injectCte` breaks on `WITH RECURSIVE`

The regex trimmed.match(/^WITH\s+/i) matches WITH RECURSIVE and strips WITH , leaving RECURSIVE cte_name AS (...) — which is invalid SQL. Should detect and preserve RECURSIVE.

MINOR: ClickHouse partition WHERE uses ANSI double-quotes instead of backticks

buildPartitionWhereClause uses "${partitionColumn}" (ANSI quoting) but ClickHouse requires backtick quoting. This would fail with a syntax error for reserved-word column names on ClickHouse.

🤖 Generated with Claude Code

- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)

Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.

When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping.

Rust serializes ReladiffOutcome with serde tag 'mode', producing: {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}} Previous code checked for {Match: {...}} / {Diff: {...}} shapes that never matched, causing partitioned diff to report all partitions as 'identical' with 0 rows. - extractStats(): check outcome.mode === 'diff', read from stats fields - mergeOutcomes(): aggregate mode-based outcomes correctly - summarize()/formatOutcome(): display mode-based shape with correct labels

Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections

…from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior

…ult truncation All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to 1000 rows. This silently drops rows when the data-diff engine needs complete result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated, causing the engine to undercount differences. - Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface - When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the truncation check so rows aren't sliced to zero - Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql, redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb - Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()` Interactive SQL callers are unaffected — they continue to get the default 1000-row limit. Only the data-diff pipeline opts out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…m exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them.

- `buildColumnDiscoverySQL`: escape single quotes in all interpolated table name parts to prevent SQL injection via crafted source/target names - `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL (SQL injection via table name was unpatched in these dialects) - Quote identifiers in resolveTableSources to prevent injection via table names containing semicolons or special characters - Surface SQL execution errors before feeding empty rows to the engine (silent false "match" when warehouse is unreachable is now an error) - Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week) ('WEEK' throws ORA-01800 on all Oracle versions) - Quote partition column identifier in buildPartitionWhereClause

…r propagation, and test mock formats - `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings - `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema - `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass - `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level Simulation suite: 695/839 → 839/839 (100%)

… corruption The @clickhouse/client package enables ERROR-level logging by default and writes `[ERROR][@clickhouse/client][Connection]` lines directly to stderr on auth/query failures. These raw writes corrupt the terminal TUI rendering. Set `log: { level: 127 }` (ClickHouseLogLevel.OFF) when creating the client — consistent with how Snowflake (`logLevel: 'OFF'`) and Databricks (no-op logger) already suppress their SDK loggers for the same reason.

…ack script - Validate table names before interpolating into DESCRIBE/SHOW COLUMNS for ClickHouse and Snowflake — reject names with non-alphanumeric characters to prevent SQL injection; also quote parts with dialect-appropriate delimiters - Discover partition values from BOTH source and target tables and union the results — previously only source was queried, silently missing rows that existed only in target-side partitions - Add script/pack-local.ts: mirrors publish.ts but stops before npm publish; injects local altimate-core tarballs from /tmp/altimate-local-dist/ for local end-to-end testing

Require that every diff result summary surfaces: - Exact scope (tables + warehouses compared) - Filters and time period applied (or explicitly states none) - Key columns used and how they were confirmed - Columns compared and excluded, with reasons (auto-timestamp, user request) - Algorithm used Includes example full result summary and guidance for identical results — emphasising that bare numbers without context are meaningless to the user.

The partitioned diff returned `{ Match: { row_count: 0, algorithm: 'partitioned' } }` when no partition values were found or all partitions failed. This format lacks `mode: 'diff'`, so `formatOutcome` fell through to raw JSON.stringify instead of producing clean output. Use the standard Rust engine format: `{ mode: 'diff', stats: {...}, diff_rows: [] }`

…y comparison modes

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

Copilot

Pull request overview

Adds a new data-parity / data_diff workflow to the Altimate toolchain by introducing a TypeScript “orchestrator bridge” for the Rust DataParitySession state machine, extending driver execution to support unbounded result sets, and documenting a new /data-parity skill for correct algorithm selection.

Changes:

Introduces data_diff tool + native connection handler that runs the Rust diff session by executing emitted SQL tasks across warehouse connectors (including partitioned diffs and column auto-discovery).
Extends the drivers Connector.execute() API with ExecuteOptions.noLimit to bypass default LIMIT injection and truncation.
Adds /data-parity skill documentation and updates prompts/registry + adjusts simulation mocks to match updated dispatcher response shapes.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
packages/opencode/test/altimate/simulation-suite.test.ts	Updates mocked dispatcher response shapes for dbt manifest + SQL translation scenarios.
packages/opencode/src/tool/registry.ts	Registers the new `DataDiffTool`.
packages/opencode/src/altimate/tools/data-diff.ts	Adds the `data_diff` LLM tool wrapper and output formatting.
packages/opencode/src/altimate/prompts/builder.txt	Documents and wires up `/data-parity` skill invocation triggers.
packages/opencode/src/altimate/native/types.ts	Adds bridge types for `data.diff` params/results and registers the method.
packages/opencode/src/altimate/native/connections/register.ts	Registers the native `data.diff` dispatcher method.
packages/opencode/src/altimate/native/connections/data-diff.ts	Implements the TypeScript orchestrator bridging Rust session steps to warehouse SQL execution, plus partition support and column discovery.
packages/drivers/src/types.ts	Adds `ExecuteOptions` and extends `Connector.execute()` signature to accept options.
packages/drivers/src/{postgres,redshift,snowflake,bigquery,mysql,sqlite,duckdb,databricks,sqlserver,oracle,clickhouse}.ts	Threads `noLimit` option through each driver and fixes truncation detection when limit is disabled.
.opencode/skills/data-parity/SKILL.md	Adds the data-parity skill guide and algorithm-selection procedure.

Comments suppressed due to low confidence (1)

packages/drivers/src/clickhouse.ts:20

PR description claims the ClickHouse driver is a “pure HTTP driver” with “no external package required”, but this implementation still depends on @clickhouse/client and throws an install error if it’s missing.

Either update the PR description to match the current dependency, or rework the driver to avoid the external client if that’s the intended change.

import type { ConnectionConfig, Connector, ConnectorResult, ExecuteOptions, SchemaColumn } from "./types"

export async function connect(config: ConnectionConfig): Promise<Connector> {
  let createClient: any
  try {
    const mod = await import("@clickhouse/client")
    createClient = mod.createClient ?? mod.default?.createClient
    if (!createClient) {
      throw new Error("createClient export not found in @clickhouse/client")
    }
  } catch {
    throw new Error("ClickHouse driver not installed. Run: npm install @clickhouse/client")
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T21:10:16Z

packages/drivers/src/types.ts

+}
+
 export interface Connector {
  connect(): Promise<void>


Connector.execute() now accepts a 4th options?: ExecuteOptions param, but at least one connector implementation (MongoDB driver) still uses the old 3-arg signature. This will fail TypeScript structural typing for Promise<Connector> and break builds.

Update packages/drivers/src/mongodb.ts to accept (sql, limit?, binds?, options?) (it can ignore options), or make the Connector interface backwards-compatible via a rest/overload.

Suggested change

connect(): Promise<void>

connect(): Promise<void>

execute(sql: string, limit?: number, binds?: any[]): Promise<ConnectorResult>

Not applicable — TypeScript Connector interface uses options?: ExecuteOptions which is optional. MongoDB's 3-arg signature is structurally compatible since the 4th param is optional (?). TypeScript does not require implementations to declare optional trailing params they don't use. Verified: bun run script/build.ts compiles successfully.

Copilot · 2026-04-03T21:10:16Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+  // At least one is a query — wrap both in CTEs
+  // Quote identifier parts so table names with special chars don't inject SQL.
+  // Use double-quote escaping (ANSI SQL standard, works in Postgres/Snowflake/DuckDB/etc.)
+  const quoteIdent = (name: string) =>
+    name
+      .split(".")
+      .map((p) => `"${p.replace(/"/g, '""')}"`)
+      .join(".")
+  const srcExpr = source_is_query ? source : `SELECT * FROM ${quoteIdent(source)}`
+  const tgtExpr = target_is_query ? target : `SELECT * FROM ${quoteIdent(target)}`
+
+  const ctePrefix = `WITH __diff_source AS (\n${srcExpr}\n), __diff_target AS (\n${tgtExpr}\n)`


resolveTableSources() wraps plain table names as SELECT * FROM "schema"."table" using ANSI double-quote identifier escaping. This will break query-vs-table comparisons on dialects that don’t accept double quotes as identifier quotes by default (notably MySQL/MariaDB, and BigQuery typically expects backticks).

Consider passing the warehouse dialect into resolveTableSources and using dialect-specific identifier quoting, or (safer) validating table identifiers and leaving them unquoted.

Not applicable — resolveTableSources only wraps in CTEs when at least one side is a SQL query (line 43: if (!source_is_query && !target_is_query) return). When both sides are plain table names, they pass through unchanged and the Rust engine handles dialect-specific quoting. The ANSI double-quote path only runs for mixed query+table comparisons, where the CTE wrapper SQL runs on a single connection (one dialect), not cross-dialect.

Copilot · 2026-04-03T21:10:16Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+    case "oracle": {
+      // Oracle uses TRUNC() with format models — 'WEEK' is invalid, use 'IW' for ISO week
+      const oracleFmt: Record<string, string> = {
+        day: "DDD",
+        week: "IW",
+        month: "MM",
+        year: "YYYY",
+        quarter: "Q",
+        hour: "HH",
+        minute: "MI",
+      }
+      return `TRUNC(${column}, '${oracleFmt[g] ?? g.toUpperCase()}')`
+    }


Oracle TRUNC(date, 'DDD') truncates to the first day of the year, not to the day boundary. For day-level partitioning this will collapse all dates in the same year into one bucket.

Use TRUNC(col, 'DD') for day granularity (and keep IW/MM/YYYY for week/month/year).

Valid — Oracle TRUNC(date, 'DDD') truncates to day-of-year, not day-of-month. Should be 'DD' for day-level granularity. Will fix.

Copilot · 2026-04-03T21:10:17Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+  for (const pVal of partitionValues) {
+    const partWhere = buildPartitionWhereClause(
+      params.partition_column!,
+      pVal,
+      params.partition_granularity,
+      params.partition_bucket_size,
+      sourceDialect,
+    )
+    const fullWhere = params.where_clause ? `(${params.where_clause}) AND (${partWhere})` : partWhere
+
+    const result = await runDataDiff({
+      ...params,
+      where_clause: fullWhere,
+      partition_column: undefined, // prevent recursion
+    })


runPartitionedDiff() builds partWhere using sourceDialect, then passes the combined where_clause into runDataDiff(). The Rust engine applies the same where_clause to both sides, so cross-warehouse partitioning will generate dialect-specific SQL that may not run on the target dialect.

Either (a) restrict partitioned diffs to same-dialect comparisons with a clear error, or (b) generate dialect-agnostic partition predicates (e.g., date partitions as col >= start AND col < end) so the same predicate works on both dialects.

Valid concern, but mitigated in practice: the partition WHERE clauses we generate are basic comparison predicates (col >= N AND col < M, col = 'value', DATE_TRUNC(...)) that are valid across all supported dialects. The Rust engine's where_clause is applied independently to each table's SQL using that table's dialect for quoting/syntax. The partition predicate itself uses standard SQL operators. Will add a comment documenting this invariant.

Copilot · 2026-04-03T21:10:17Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+  const mode = partitionMode(granularity, bucketSize)
+  // Quote the column identifier to handle special characters and reserved words
+  const quotedCol = `"${partitionColumn.replace(/"/g, '""')}"`
+
+  if (mode === "numeric") {
+    const lo = Number(partitionValue)
+    const hi = lo + bucketSize!
+    return `${quotedCol} >= ${lo} AND ${quotedCol} < ${hi}`
+  }
+
+  if (mode === "categorical") {
+    // Quote the value — works for strings, enums, booleans
+    const escaped = partitionValue.replace(/'/g, "''")
+    return `${quotedCol} = '${escaped}'`
+  }


buildPartitionWhereClause() always double-quotes the partition column identifier. This is not portable (e.g., MySQL treats double quotes as string literals unless ANSI_QUOTES is enabled; SQL Server typically uses brackets).

Use dialect-specific identifier quoting here (or validate and avoid quoting), otherwise partitioned diffs can fail purely due to identifier quoting.

Valid — the partition WHERE clause uses ANSI double-quotes which MySQL treats as string literals by default. The Rust engine handles dialect-specific quoting for its own SQL, but these partition predicates are generated in TypeScript. Will add dialect-aware quoting to buildPartitionWhereClause.

Copilot · 2026-04-03T21:10:17Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+  if (mode === "categorical") {
+    // Quote the value — works for strings, enums, booleans
+    const escaped = partitionValue.replace(/'/g, "''")
+    return `${quotedCol} = '${escaped}'`
+  }
+
+  // date mode
+  const expr = dateTruncExpr(granularity!, quotedCol, dialect)
+
+  // Cast the literal appropriately per dialect
+  switch (dialect) {
+    case "bigquery":
+      return `${expr} = '${partitionValue}'`
+    case "clickhouse":
+      return `${expr} = toDate('${partitionValue}')`
+    case "mysql":
+    case "mariadb":
+      return `${expr} = '${partitionValue}'`
+    default:
+      return `${expr} = '${partitionValue}'`
+  }


Categorical partition filtering always generates col = 'value'. Because executeQuery() stringifies all discovery results, boolean (and sometimes numeric) partition values will be compared as strings, which is invalid in some dialects (e.g., BigQuery BOOL = STRING).

Preserve primitive types from partition discovery (don’t stringify), or add literal rendering that handles booleans/numbers without quotes and dates with proper casts per dialect.

Acknowledged — minor edge case. Partition discovery returns string representations from the database driver. For categorical columns the single-quote wrapping works for strings and numerics (databases auto-cast '123' to INT in comparisons). Boolean columns on BigQuery could be an issue but this is a rare combination (partitioning by boolean). Will add a note.

Copilot · 2026-04-03T21:10:18Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+
+  let expr: string
+  if (mode === "numeric") {
+    expr = `FLOOR(${partitionColumn} / ${bucketSize}) * ${bucketSize}`
+  } else if (mode === "date") {
+    expr = dateTruncExpr(granularity!, partitionColumn, dialect)
+  } else {
+    // categorical — raw distinct values, no transformation
+    expr = partitionColumn


buildPartitionDiscoverySQL() interpolates partitionColumn directly into the SELECT expression without any identifier quoting/validation. Partition columns that are reserved words or contain special characters will break the discovery query.

Apply the same identifier-quoting strategy used elsewhere (dialect-aware), or validate partitionColumn against a safe identifier regex before interpolation.

Suggested change

let expr: string

if (mode === "numeric") {

expr = `FLOOR(${partitionColumn} / ${bucketSize}) * ${bucketSize}`

} else if (mode === "date") {

expr = dateTruncExpr(granularity!, partitionColumn, dialect)

} else {

// categorical — raw distinct values, no transformation

expr = partitionColumn

// Quote the column identifier to handle special characters and reserved words

const quotedCol = `"${partitionColumn.replace(/"/g, '""')}"`

let expr: string

if (mode === "numeric") {

expr = `FLOOR(${quotedCol} / ${bucketSize}) * ${bucketSize}`

} else if (mode === "date") {

expr = dateTruncExpr(granularity!, quotedCol, dialect)

} else {

// categorical — raw distinct values, no transformation

expr = quotedCol

Valid — buildPartitionDiscoverySQL should quote the partition column. Will fix alongside the buildPartitionWhereClause quoting fix to use consistent dialect-aware quoting.

Copilot · 2026-04-03T21:10:18Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+  const sourceDialect = resolveDialect(params.source_warehouse)
+  const targetDialect = resolveDialect(params.target_warehouse ?? params.source_warehouse)
+  const { table1Name, table2Name } = resolveTableSources(params.source, params.target)
+
+  // Discover partition values from BOTH source and target to catch target-only partitions.
+  // Without this, rows that exist only in target partitions are silently missed.
+  const sourceDiscoverySql = buildPartitionDiscoverySQL(
+    table1Name,
+    params.partition_column!,
+    params.partition_granularity,
+    params.partition_bucket_size,
+    sourceDialect,
+    params.where_clause,
+  )
+  const targetDiscoverySql = buildPartitionDiscoverySQL(
+    table2Name,
+    params.partition_column!,
+    params.partition_granularity,
+    params.partition_bucket_size,
+    targetDialect,
+    params.where_clause,
+  )


runPartitionedDiff() calls resolveTableSources() but discards the returned ctePrefix. If source/target is a SQL query (or query-vs-table), table1Name/table2Name become __diff_source/__diff_target, but the partition discovery SQL is executed without injecting the CTE definitions, so discovery will fail.

Either disallow partition_column when either side is a query, or plumb ctePrefix through and injectCte() into the partition discovery queries (and any other pre-session SQL executed in partition mode).

Valid — partitioned diff with SQL queries as source/target would fail. However, this combination is intentionally unsupported: you can't meaningfully partition a SQL query by column without knowing the underlying table. The current code path only reaches runPartitionedDiff when partition_column is set, which implies table-name inputs. Will add an explicit check and error message for this case.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

packages/drivers/src/redshift.ts (1)

49-64: ⚠️ Potential issue | 🔴 Critical

Redshift driver breaks parameterized queries by ignoring bind parameters.

The execute method accepts _binds but never passes them to client.query(), unlike all other drivers (Snowflake uses executeQuery(query, binds), DuckDB uses conditional queryWithParams(finalSql, binds)). This breaks SQL injection protection and query correctness for all callers passing binds.

Rename the parameter to binds (remove underscore) and pass it to the query execution:

🔧 Proposed fix

-    async execute(sql: string, limit?: number, _binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
+    async execute(sql: string, limit?: number, binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
       const client = await pool.connect()
       try {
         const effectiveLimit = options?.noLimit ? 0 : (limit ?? 1000)
         let query = sql
         const isSelectLike = /^\s*(SELECT|WITH|VALUES)\b/i.test(sql)
         if (
           isSelectLike &&
           effectiveLimit &&
           !/\bLIMIT\b/i.test(sql)
         ) {
           query = `${sql.replace(/;\s*$/, "")} LIMIT ${effectiveLimit + 1}`
         }
 
-        const result = await client.query(query)
+        const result =
+          binds && binds.length > 0
+            ? await client.query(query, binds)
+            : await client.query(query)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/redshift.ts` around lines 49 - 64, The execute method in
packages/drivers/src/redshift.ts currently ignores the bind parameters by naming
the param _binds and not passing them to client.query; rename the method
parameter from _binds to binds (and update any internal references) and pass
binds into the database call (i.e., client.query(query, binds)) so parameterized
queries work like other drivers (see Snowflake/DuckDB patterns) and preserve SQL
injection protection and correctness.

packages/drivers/src/mysql.ts (1)

44-56: ⚠️ Potential issue | 🟠 Major

MySQL execute accepts binds parameter but never passes it to pool.query().

The interface contract (line 31 of types.ts) requires binds?: any[] to be used. Line 44 accepts it as _binds (underscore prefix indicates intentional ignoring), but line 56 calls pool.query(query) without them. Call sites like schema/tags.ts:92 and finops/*.ts pass parameterized binds expecting them to work. Parameterized queries with placeholders will fail silently in this driver.

Snowflake shows the correct pattern: pass binds to executeQuery(query, binds).

🔧 Proposed fix

-    async execute(sql: string, limit?: number, _binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
+    async execute(sql: string, limit?: number, binds?: any[], options?: ExecuteOptions): Promise<ConnectorResult> {
       const effectiveLimit = options?.noLimit ? 0 : (limit ?? 1000)
       let query = sql
       const isSelectLike = /^\s*(SELECT|WITH|VALUES)\b/i.test(sql)
@@
-      const [rows, fields] = await pool.query(query)
+      const [rows, fields] =
+        binds && binds.length > 0
+          ? await pool.query(query, binds)
+          : await pool.query(query)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/mysql.ts` around lines 44 - 56, The execute function
currently ignores the binds parameter (_binds) and calls pool.query(query)
without passing parameters, causing parameterized queries to fail; update the
execute signature to accept binds (e.g., binds?: any[]) or rename _binds to
binds and pass them into pool.query (call pool.query(query, binds) or
pool.query(query, binds ?? []) so existing parameterized callers work), ensuring
the LIMIT augmentation logic still operates on the query string and does not
drop or alter binds; key symbols: execute(sql, limit?, _binds?, options?),
pool.query(query).

packages/drivers/src/sqlserver.ts (1)

49-61: ⚠️ Potential issue | 🟡 Minor

Include CTEs in the default row-cap detection.

Line 49 only matches SELECT, so WITH ... SELECT queries never get TOP injected and bypass the default 1,000-row safety cap. All other drivers in this codebase (postgres, oracle, redshift, mysql, duckdb, databricks, snowflake, bigquery) include WITH in their isSelectLike pattern.
💡 Minimal fix
-      const isSelectLike = /^\s*SELECT\b/i.test(sql)
+      const isSelectLike = /^\s*(SELECT|WITH)\b/i.test(sql)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/sqlserver.ts` around lines 49 - 61, The current
isSelectLike regex only matches queries starting with SELECT so CTEs (WITH ...
SELECT) bypass the TOP injection; update the isSelectLike detection in
sqlserver.ts to also match queries that begin with WITH ... SELECT (i.e., treat
queries with leading CTEs as "select-like"), and adjust the TOP-insertion logic
so it still injects TOP for the main SELECT following the CTE (keep using
effectiveLimit and the existing checks for existing TOP/LIMIT). Target symbols:
isSelectLike, effectiveLimit, and the query replacement that inserts TOP after
SELECT.

🧹 Nitpick comments (6)

packages/drivers/src/oracle.ts (1)
40-75: Extract the limit/truncation policy into a shared helper.

effectiveLimit, truncated, and the final row slice are now repeated across the driver execute implementations touched in this PR. Centralizing that policy will make future noLimit fixes much less likely to drift by dialect.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/drivers/src/oracle.ts` around lines 40 - 75, The limit/truncation
logic repeated in execute (variables effectiveLimit, truncated, and slicing to
produce limitedRows) should be moved into a shared helper (e.g., a util like
computeLimitAndRows or applyLimitPolicy) and invoked from Oracle's async execute
method; refactor so execute only calls the helper with (sqlLimit,
options?.noLimit, rows) and uses the returned { effectiveLimit, truncated, rows:
limitedRows, row_count } to build the ConnectorResult, keeping existing behavior
for isSelectLike/ROWNUM wrapping inside execute and preserving columns
extraction via result.metaData.
packages/opencode/src/altimate/native/types.ts (1)
967-983: Narrow algorithm to the documented literal union.

Since data.diff is a public bridge contract, keeping algorithm as plain string defers validation to runtime. The implementation only handles the five documented values; a literal union would catch invalid names at the boundary.
Type refinement
+export type DataDiffAlgorithm =
+  | "auto"
+  | "joindiff"
+  | "hashdiff"
+  | "profile"
+  | "cascade"
+
 export interface DataDiffParams {
   /** Source table name (e.g. "orders", "db.schema.orders") or full SQL query */
   source: string
@@
   /** Extra columns to compare beyond the key */
   extra_columns?: string[]
   /** Algorithm: "auto" | "joindiff" | "hashdiff" | "profile" | "cascade" */
-  algorithm?: string
+  algorithm?: DataDiffAlgorithm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/native/types.ts` around lines 967 - 983, The
DataDiffParams interface currently types the algorithm field as a plain string;
narrow it to the documented literal union so invalid values are caught at
compile time. Update the DataDiffParams type (the algorithm property) to be the
union "auto" | "joindiff" | "hashdiff" | "profile" | "cascade" (optionally keep
it optional if intended), and adjust any call sites or tests that pass
non-literal strings to use one of those five literals; ensure any type imports
or references to DataDiffParams reflect the new union.
packages/opencode/src/altimate/native/connections/data-diff.ts (1)
390-393: Silent fallback to key-only comparison when discovery fails.

When discoverExtraColumns fails (e.g., schema query error), it returns undefined and the orchestrator silently falls back to key-only comparison. This could mask configuration issues or produce misleading "identical" results.

Consider logging a warning when discovery fails so users are aware.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines
390 - 393, The catch block in discoverExtraColumns swallows errors and returns
undefined, causing a silent fallback to key-only comparison; modify the catch to
capture the error (catch (err)) and emit a warning log before returning
undefined so users are informed (e.g., use the module's logger or propagate to
the orchestrator) and include the error message/context in the log; ensure the
change is made inside the discoverExtraColumns function where the current `catch
{ return undefined }` resides.
.opencode/skills/data-parity/SKILL.md (2)
12-23: Add language specifier to fenced code block.

The TODO list code block lacks a language specifier. While this is a minor linting issue, adding a language (e.g., markdown or text) improves rendering consistency.
📝 Suggested fix
-```
+```text
 Here's my plan:
 1. [ ] List available warehouse connections
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.opencode/skills/data-parity/SKILL.md around lines 12 - 23, The fenced TODO
list code block lacks a language tag; update the triple-backtick fence
surrounding the plan (the multiline TODO block starting with "Here's my plan:")
to include a language specifier such as "text" or "markdown" (e.g., ```text) so
the block renders consistently; ensure you change both opening and closing
fences only, leaving the inner checklist lines unchanged.
165-174: Add language specifier for tool invocation examples.

Multiple code blocks showing data_diff invocations lack language specifiers (lines 165, 217, 233). Consider using python or javascript syntax for better readability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.opencode/skills/data-parity/SKILL.md around lines 165 - 174, The code
examples invoking the data_diff helper are missing fenced-code language
specifiers; update each code block that shows data_diff (the three examples
invoking data_diff with
source/target/key_columns/source_warehouse/target_warehouse/algorithm) to use a
language tag like ```python (or ```javascript) so syntax highlighting and
readability improve for those code fences.
packages/opencode/src/altimate/tools/data-diff.ts (1)
114-115: Prefer typed outcome over as any cast.

The outcome is cast to any and passed to helper functions that also treat it as any. Consider defining a discriminated union type matching the Rust serialization ({ mode: "diff" | "profile" | "cascade", ... }) for better type safety and maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/tools/data-diff.ts` around lines 114 - 115,
The code uses an untyped cast `const outcome = result.outcome as any` and then
passes it into `formatOutcome(outcome, args.source, args.target)`, losing type
safety; define a discriminated union (e.g., type Outcome = { mode: "diff"; ... }
| { mode: "profile"; ... } | { mode: "cascade"; ... }) that matches the Rust
serialization, replace the `as any` cast with `const outcome: Outcome =
result.outcome`, and update `formatOutcome` (and any helper functions it calls)
to accept `Outcome` so callers can narrow on `outcome.mode` instead of relying
on `any`.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/drivers/src/clickhouse.ts`:
- Around line 60-63: Replace the magic number 127 with the exported enum
constant by importing ClickHouseLogLevel from '@clickhouse/client' and setting
clientConfig.log = { level: ClickHouseLogLevel.OFF }; update the import list
that currently includes createClient and ExecuteOptions to also import
ClickHouseLogLevel so the createClient(clientConfig) call uses the named
constant instead of 127.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 458-480: The SELECT in buildPartitionDiscoverySQL interpolates
partitionColumn unquoted causing syntax/injection risk; update
buildPartitionDiscoverySQL to quote the partition column the same way
buildPartitionWhereClause does (use the project’s identifier-quoting helper or
the same quoting logic used by buildPartitionWhereClause) when constructing expr
for all modes (numeric, date, categorical) and when passing partitionColumn into
dateTruncExpr/other helpers so the generated SQL uses a properly quoted
identifier (_p AS derived from the quoted column) and avoids injection/syntax
errors while preserving existing partitionMode/dateTruncExpr behavior.
- Around line 512-523: The ClickHouse branch is using ANSI double-quote
identifier quoting (see the quotedCol usage and the switch case for "clickhouse"
in data-diff.ts) which breaks on reserved words; update the dialect-aware
quoting so that when dialect === "clickhouse" identifiers are wrapped with
backticks (`) rather than double-quotes (") and ensure the partition WHERE
expression generation (the "clickhouse" case returning `${expr} =
toDate('${partitionValue}')`) uses that backtick-quoted identifier; ideally
centralize this in the existing quotedCol helper or add a small helper that
returns backtick-quoted names for ClickHouse and preserves existing quoting for
other dialects.
- Around line 78-92: injectCte currently strips a possible RECURSIVE keyword
because the regex /^WITH\s+/i only matches "WITH " so when rebuilding you end up
with "WITH <ourDefs>,\nRECURSIVE ..." which is invalid; update the detection to
capture an optional RECURSIVE (e.g. use a regex like /^WITH(\s+RECURSIVE)?\s+/i)
so you can preserve that token, compute the remainder (afterWith) the same way,
and rebuild the prefix using the captured recursive group (e.g.
`WITH${recursive} ${ourDefs},\n${afterWith}`) while still stripping only the
leading "WITH…" from ctePrefix via ourDefs = ctePrefix.replace(/^WITH\s+/i, "").
- Around line 660-667: The partition WHERE clauses are being generated only for
sourceDialect which breaks target parsing when sourceDialect !== targetDialect;
update the partition loop that calls buildPartitionWhereClause (inside the
partitionValues iteration) to produce dialect-specific where clauses for both
sourceDialect and targetDialect (similar to how partition discovery does),
storing them separately and passing both appropriately into runDataDiff / the
DataParitySession spec so the source uses the source WHERE and the target uses
the target WHERE; ensure you reference the same params (partition_column,
partition_granularity, partition_bucket_size) when calling
buildPartitionWhereClause for each dialect.

In `@packages/opencode/src/altimate/tools/data-diff.ts`:
- Around line 82-87: The current ctx.ask call passes args.source and args.target
(table names/queries) to the "sql_execute_read" permission check, which doesn't
validate warehouse-level access; update the permission validation to explicitly
include warehouse identifiers so both source and target warehouses are
authorized — e.g., change the ctx.ask invocation in data-diff.ts (the call to
ctx.ask for "sql_execute_read") to include args.source_warehouse and
args.target_warehouse in the patterns or perform separate ctx.ask calls for each
warehouse, ensuring both warehouses are checked before executing queries.

---

Outside diff comments:
In `@packages/drivers/src/mysql.ts`:
- Around line 44-56: The execute function currently ignores the binds parameter
(_binds) and calls pool.query(query) without passing parameters, causing
parameterized queries to fail; update the execute signature to accept binds
(e.g., binds?: any[]) or rename _binds to binds and pass them into pool.query
(call pool.query(query, binds) or pool.query(query, binds ?? []) so existing
parameterized callers work), ensuring the LIMIT augmentation logic still
operates on the query string and does not drop or alter binds; key symbols:
execute(sql, limit?, _binds?, options?), pool.query(query).

In `@packages/drivers/src/redshift.ts`:
- Around line 49-64: The execute method in packages/drivers/src/redshift.ts
currently ignores the bind parameters by naming the param _binds and not passing
them to client.query; rename the method parameter from _binds to binds (and
update any internal references) and pass binds into the database call (i.e.,
client.query(query, binds)) so parameterized queries work like other drivers
(see Snowflake/DuckDB patterns) and preserve SQL injection protection and
correctness.

In `@packages/drivers/src/sqlserver.ts`:
- Around line 49-61: The current isSelectLike regex only matches queries
starting with SELECT so CTEs (WITH ... SELECT) bypass the TOP injection; update
the isSelectLike detection in sqlserver.ts to also match queries that begin with
WITH ... SELECT (i.e., treat queries with leading CTEs as "select-like"), and
adjust the TOP-insertion logic so it still injects TOP for the main SELECT
following the CTE (keep using effectiveLimit and the existing checks for
existing TOP/LIMIT). Target symbols: isSelectLike, effectiveLimit, and the query
replacement that inserts TOP after SELECT.

---

Nitpick comments:
In @.opencode/skills/data-parity/SKILL.md:
- Around line 12-23: The fenced TODO list code block lacks a language tag;
update the triple-backtick fence surrounding the plan (the multiline TODO block
starting with "Here's my plan:") to include a language specifier such as "text"
or "markdown" (e.g., ```text) so the block renders consistently; ensure you
change both opening and closing fences only, leaving the inner checklist lines
unchanged.
- Around line 165-174: The code examples invoking the data_diff helper are
missing fenced-code language specifiers; update each code block that shows
data_diff (the three examples invoking data_diff with
source/target/key_columns/source_warehouse/target_warehouse/algorithm) to use a
language tag like ```python (or ```javascript) so syntax highlighting and
readability improve for those code fences.

In `@packages/drivers/src/oracle.ts`:
- Around line 40-75: The limit/truncation logic repeated in execute (variables
effectiveLimit, truncated, and slicing to produce limitedRows) should be moved
into a shared helper (e.g., a util like computeLimitAndRows or applyLimitPolicy)
and invoked from Oracle's async execute method; refactor so execute only calls
the helper with (sqlLimit, options?.noLimit, rows) and uses the returned {
effectiveLimit, truncated, rows: limitedRows, row_count } to build the
ConnectorResult, keeping existing behavior for isSelectLike/ROWNUM wrapping
inside execute and preserving columns extraction via result.metaData.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 390-393: The catch block in discoverExtraColumns swallows errors
and returns undefined, causing a silent fallback to key-only comparison; modify
the catch to capture the error (catch (err)) and emit a warning log before
returning undefined so users are informed (e.g., use the module's logger or
propagate to the orchestrator) and include the error message/context in the log;
ensure the change is made inside the discoverExtraColumns function where the
current `catch { return undefined }` resides.

In `@packages/opencode/src/altimate/native/types.ts`:
- Around line 967-983: The DataDiffParams interface currently types the
algorithm field as a plain string; narrow it to the documented literal union so
invalid values are caught at compile time. Update the DataDiffParams type (the
algorithm property) to be the union "auto" | "joindiff" | "hashdiff" | "profile"
| "cascade" (optionally keep it optional if intended), and adjust any call sites
or tests that pass non-literal strings to use one of those five literals; ensure
any type imports or references to DataDiffParams reflect the new union.

In `@packages/opencode/src/altimate/tools/data-diff.ts`:
- Around line 114-115: The code uses an untyped cast `const outcome =
result.outcome as any` and then passes it into `formatOutcome(outcome,
args.source, args.target)`, losing type safety; define a discriminated union
(e.g., type Outcome = { mode: "diff"; ... } | { mode: "profile"; ... } | { mode:
"cascade"; ... }) that matches the Rust serialization, replace the `as any` cast
with `const outcome: Outcome = result.outcome`, and update `formatOutcome` (and
any helper functions it calls) to accept `Outcome` so callers can narrow on
`outcome.mode` instead of relying on `any`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4ae147d4-a6e5-4a2e-8d5c-cfa62bc9db0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0d34855 and e41e5a0.

📒 Files selected for processing (20)

.opencode/skills/data-parity/SKILL.md
packages/drivers/src/bigquery.ts
packages/drivers/src/clickhouse.ts
packages/drivers/src/databricks.ts
packages/drivers/src/duckdb.ts
packages/drivers/src/mysql.ts
packages/drivers/src/oracle.ts
packages/drivers/src/postgres.ts
packages/drivers/src/redshift.ts
packages/drivers/src/snowflake.ts
packages/drivers/src/sqlite.ts
packages/drivers/src/sqlserver.ts
packages/drivers/src/types.ts
packages/opencode/src/altimate/native/connections/data-diff.ts
packages/opencode/src/altimate/native/connections/register.ts
packages/opencode/src/altimate/native/types.ts
packages/opencode/src/altimate/prompts/builder.txt
packages/opencode/src/altimate/tools/data-diff.ts
packages/opencode/src/tool/registry.ts
packages/opencode/test/altimate/simulation-suite.test.ts

coderabbitai · 2026-04-03T21:15:20Z

packages/drivers/src/clickhouse.ts

+      // Silence the client's internal stderr logger — its ERROR-level output
+      // writes raw lines directly to stderr and corrupts terminal TUI rendering.
+      clientConfig.log = { level: 127 } // ClickHouseLogLevel.OFF = 127
      client = createClient(clientConfig)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

What are the log level constants in @clickhouse/client npm package? What value is ClickHouseLogLevel.OFF?

💡 Result:

The log level constants in the @clickhouse/client npm package are part of the ClickHouseLogLevel enum. The allowed values are: OFF, TRACE, DEBUG, INFO, WARN, ERROR. ClickHouseLogLevel.OFF disables logging (previously the default, now WARN is default in recent versions). These are configured in the client options as log: { level: ClickHouseLogLevel.WARN }. Default logger emits to stdout/stderr via console methods. Custom LoggerClass can be provided.

Citations:

1: https://clickhouse.com/docs/en/integrations/language-clients/javascript

2: https://github.com/ClickHouse/clickhouse-js/releases

3: https://clickhouse.com/docs/integrations/javascript

Replace magic number with ClickHouseLogLevel.OFF constant.

The code uses hardcoded 127, but @clickhouse/client exports ClickHouseLogLevel.OFF as a named constant. Replace the magic number to improve maintainability:

import { createClient, type ExecuteOptions, ClickHouseLogLevel } from '@clickhouse/client' // ... clientConfig.log = { level: ClickHouseLogLevel.OFF }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/drivers/src/clickhouse.ts` around lines 60 - 63, Replace the magic number 127 with the exported enum constant by importing ClickHouseLogLevel from '@clickhouse/client' and setting clientConfig.log = { level: ClickHouseLogLevel.OFF }; update the import list that currently includes createClient and ExecuteOptions to also import ClickHouseLogLevel so the createClient(clientConfig) call uses the named constant instead of 127.

Acknowledged — cosmetic improvement. We used the numeric value to avoid importing from @clickhouse/client (the driver dynamically imports it). The numeric constant is correct and documented in the comment. Will leave as-is for now since the dynamic import makes a static import of the enum awkward.

coderabbitai · 2026-04-03T21:15:20Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+export function injectCte(sql: string, ctePrefix: string): string {
+  const trimmed = sql.trimStart()
+  const withMatch = trimmed.match(/^WITH\s+/i)
+
+  if (withMatch) {
+    // Engine also has CTEs — merge them: our CTEs first, then engine CTEs
+    const afterWith = trimmed.slice(withMatch[0].length)
+    // ctePrefix already starts with "WITH …" — strip "WITH " and append ", "
+    const ourDefs = ctePrefix.replace(/^WITH\s+/i, "")
+    return `WITH ${ourDefs},\n${afterWith}`
+  }
+
+  // Plain SELECT — just prepend our CTE block
+  return `${ctePrefix}\n${trimmed}`
+}


⚠️ Potential issue | 🟡 Minor

injectCte strips RECURSIVE keyword from WITH RECURSIVE clauses.

The regex trimmed.match(/^WITH\s+/i) matches WITH RECURSIVE ... and the subsequent afterWith = trimmed.slice(withMatch[0].length) removes only WITH (5-6 chars), leaving RECURSIVE .... However, line 87 then produces WITH ${ourDefs},\nRECURSIVE ... which is invalid SQL.

This was flagged in the PR comments as a minor issue.

🐛 Proposed fix to preserve RECURSIVE

export function injectCte(sql: string, ctePrefix: string): string { const trimmed = sql.trimStart() - const withMatch = trimmed.match(/^WITH\s+/i) + const withMatch = trimmed.match(/^WITH(\s+RECURSIVE)?\s+/i) if (withMatch) { // Engine also has CTEs — merge them: our CTEs first, then engine CTEs const afterWith = trimmed.slice(withMatch[0].length) + const recursive = withMatch[1] ? "RECURSIVE " : "" // ctePrefix already starts with "WITH …" — strip "WITH " and append ", " - const ourDefs = ctePrefix.replace(/^WITH\s+/i, "") - return `WITH ${ourDefs},\n${afterWith}` + const ourDefs = ctePrefix.replace(/^WITH\s+/i, "") + return `WITH ${recursive}${ourDefs},\n${afterWith}` } // Plain SELECT — just prepend our CTE block return `${ctePrefix}\n${trimmed}` }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines 78 - 92, injectCte currently strips a possible RECURSIVE keyword because the regex /^WITH\s+/i only matches "WITH " so when rebuilding you end up with "WITH <ourDefs>,\nRECURSIVE ..." which is invalid; update the detection to capture an optional RECURSIVE (e.g. use a regex like /^WITH(\s+RECURSIVE)?\s+/i) so you can preserve that token, compute the remainder (afterWith) the same way, and rebuild the prefix using the captured recursive group (e.g. `WITH${recursive} ${ourDefs},\n${afterWith}`) while still stripping only the leading "WITH…" from ctePrefix via ourDefs = ctePrefix.replace(/^WITH\s+/i, "").

Not applicable — the Rust engine never emits WITH RECURSIVE CTEs. The engine generates simple SELECT, WITH ... AS (non-recursive), and FULL OUTER JOIN queries. The injectCte function only merges our source/target CTE definitions with the engine's output. No recursive CTEs are involved in data diff.

@suryaiyer95, thanks for the context — understood. Since the Rust DataParitySession engine is guaranteed to only emit non-recursive WITH ... AS and SELECT/FULL OUTER JOIN statements, the WITH RECURSIVE case will never arise in practice and the fix is unnecessary.

( ˆ꒳ˆ )

✏️ Learnings added

Learnt from: suryaiyer95 Repo: AltimateAI/altimate-code PR: 493 File: packages/opencode/src/altimate/native/connections/data-diff.ts:78-92 Timestamp: 2026-04-03T22:44:33.310Z Learning: In `packages/opencode/src/altimate/native/connections/data-diff.ts`, the `injectCte` function merges CTE definitions with SQL emitted by the Rust `DataParitySession` engine. The engine is guaranteed to only emit non-recursive SQL: simple `SELECT`, `WITH ... AS` (non-recursive), and `FULL OUTER JOIN` queries. It never emits `WITH RECURSIVE` CTEs, so no special handling for the `RECURSIVE` keyword is needed in `injectCte`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

packages/opencode/src/altimate/native/connections/data-diff.ts

coderabbitai · 2026-04-03T21:15:21Z

packages/opencode/src/altimate/native/connections/data-diff.ts

+  for (const pVal of partitionValues) {
+    const partWhere = buildPartitionWhereClause(
+      params.partition_column!,
+      pVal,
+      params.partition_granularity,
+      params.partition_bucket_size,
+      sourceDialect,
+    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if cross-dialect partitioned diffs are actually supported/tested rg -n "partition_column.*source_warehouse.*target_warehouse" --type=ts rg -n "sourceDialect.*targetDialect" --type=ts -C3

Repository: AltimateAI/altimate-code

Length of output: 50

🏁 Script executed:

#!/bin/bash # First, find and examine the data-diff.ts file fd "data-diff.ts" --type f # Get the file size to understand how to read it wc -l $(fd "data-diff.ts" --type f)

Repository: AltimateAI/altimate-code

Length of output: 317

🏁 Script executed:

#!/bin/bash # Read the relevant section of the file around lines 660-667 sed -n '650,680p' "$(fd "data-diff.ts" --type f)"

Repository: AltimateAI/altimate-code

Length of output: 223

🏁 Script executed:

#!/bin/bash # Search for buildPartitionWhereClause definition rg -n "buildPartitionWhereClause" --type=ts -A 10 | head -50

Repository: AltimateAI/altimate-code

Length of output: 2283

🏁 Script executed:

#!/bin/bash # Search for how fullWhere is used, especially in runDataDiff rg -n "fullWhere" --type=ts -B 2 -A 2 | head -60

Repository: AltimateAI/altimate-code

Length of output: 950

🏁 Script executed:

#!/bin/bash # Search for targetDialect to understand dialect handling rg -n "targetDialect" --type=ts -B 2 -A 2 | head -80

Repository: AltimateAI/altimate-code

Length of output: 1100

🏁 Script executed:

#!/bin/bash # Read a larger section around lines 600-680 sed -n '600,680p' packages/opencode/src/altimate/native/connections/data-diff.ts

Repository: AltimateAI/altimate-code

Length of output: 2855

🏁 Script executed:

#!/bin/bash # Look at the function signature and broader context before line 660 sed -n '580,610p' packages/opencode/src/altimate/native/connections/data-diff.ts

Repository: AltimateAI/altimate-code

Length of output: 1396

🏁 Script executed:

#!/bin/bash # Find the runDataDiff function signature rg -n "^async function runDataDiff|^function runDataDiff" --type=ts -A 20 | head -60

Repository: AltimateAI/altimate-code

Length of output: 50

🏁 Script executed:

#!/bin/bash # Check where buildPartitionWhereClause is called and with which dialects rg -n "buildPartitionWhereClause" --type=ts -B 3 -A 3 | grep -v "^--$"

Repository: AltimateAI/altimate-code

Length of output: 1387

🏁 Script executed:

#!/bin/bash # Find runDataDiff function and see how where_clause is used rg -n "^async function runDataDiff" --type=ts -A 100 | head -120

Repository: AltimateAI/altimate-code

Length of output: 50

🏁 Script executed:

#!/bin/bash # Search for where where_clause is used in the execution rg -n "where_clause" packages/opencode/src/altimate/native/connections/data-diff.ts -B 1 -A 1

Repository: AltimateAI/altimate-code

Length of output: 666

🏁 Script executed:

#!/bin/bash # Let me see if runDataDiff is defined or imported rg -n "runDataDiff" packages/opencode/src/altimate/native/connections/data-diff.ts | head -30

Repository: AltimateAI/altimate-code

Length of output: 199

🏁 Script executed:

#!/bin/bash # Search for function definitions that might handle the where_clause rg -n "function.*where_clause" --type=ts -B 2 -A 10

Repository: AltimateAI/altimate-code

Length of output: 50

🏁 Script executed:

#!/antml:bin/bash # Read the full context around line 770-790 to see how where_clause is used sed -n '770,800p' packages/opencode/src/altimate/native/connections/data-diff.ts

Repository: AltimateAI/altimate-code

Length of output: 962

🏁 Script executed:

#!/bin/bash # Check imports at the top of the file head -100 packages/opencode/src/altimate/native/connections/data-diff.ts | tail -50

Repository: AltimateAI/altimate-code

Length of output: 1784

🏁 Script executed:

#!/an/bash # Read the full buildPartitionWhereClause function implementation sed -n '485,530p' packages/opencode/src/altimate/native/connections/data-diff.ts

Repository: AltimateAI/altimate-code

Length of output: 1505

🏁 Script executed:

#!/bin/bash # Look at the entire runDataDiff function to understand how specs are executed sed -n '696,800p' packages/opencode/src/altimate/native/connections/data-diff.ts

Repository: AltimateAI/altimate-code

Length of output: 3597

🏁 Script executed:

#!/bin/bash # Check test files for partition scenarios with different dialects fd -e ts "test|spec" packages/opencode/src/altimate/native/connections/ | xargs grep -l "partition" 2>/dev/null | head -5

Repository: AltimateAI/altimate-code

Length of output: 50

🏁 Script executed:

#!/bin/bash # Search for test cases involving runDataDiff or runPartitionedDiff rg -n "runDataDiff|runPartitionedDiff" --type=ts packages/opencode/src/altimate/tools/ -B 2 -A 2

Repository: AltimateAI/altimate-code

Length of output: 50

🏁 Script executed:

#!/bin/bash # Check if dateTruncExpr is dialect-specific rg -n "dateTruncExpr" --type=ts -B 2 -A 10

Repository: AltimateAI/altimate-code

Length of output: 4068

Cross-database partitioned diffs generate WHERE clauses using only the source dialect, causing incompatibility when target uses a different dialect.

When date-based partitioning is enabled, buildPartitionWhereClause generates dialect-specific SQL expressions (e.g., PostgreSQL uses DATE_TRUNC(), ClickHouse uses toStartOfDay()). The partition loop at lines 660–672 calls this function only with sourceDialect, producing a WHERE clause tailored to the source database. This clause is then passed to runDataDiff as a single where_clause value that applies to both source and target tables via the DataParitySession spec.

When sourceDialect !== targetDialect, the target database will fail to parse source-dialect-specific syntax. Partition discovery (lines 609–625) correctly generates separate SQL per dialect; apply the same pattern to partition WHERE clauses by generating separate clauses per dialect when they differ.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines 660 - 667, The partition WHERE clauses are being generated only for sourceDialect which breaks target parsing when sourceDialect !== targetDialect; update the partition loop that calls buildPartitionWhereClause (inside the partitionValues iteration) to produce dialect-specific where clauses for both sourceDialect and targetDialect (similar to how partition discovery does), storing them separately and passing both appropriately into runDataDiff / the DataParitySession spec so the source uses the source WHERE and the target uses the target WHERE; ensure you reference the same params (partition_column, partition_granularity, partition_bucket_size) when calling buildPartitionWhereClause for each dialect.

Duplicate of earlier comment — see reply above.

coderabbitai · 2026-04-03T21:15:21Z

packages/opencode/src/altimate/tools/data-diff.ts

+    await ctx.ask({
+      permission: "sql_execute_read",
+      patterns: [args.source.slice(0, 120), args.target.slice(0, 120)],
+      always: ["*"],
+      metadata: {},
+    })


⚠️ Potential issue | 🟡 Minor

Permission check validates source pattern only, not target warehouse.

The permission check passes args.source and args.target as patterns, but these are table names or SQL queries — not warehouse connection identifiers. If source_warehouse and target_warehouse differ, a user with read permission on source_warehouse but not target_warehouse could still initiate queries against the unauthorized warehouse.

Consider validating warehouse-level permissions separately if the permission system supports it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/opencode/src/altimate/tools/data-diff.ts` around lines 82 - 87, The current ctx.ask call passes args.source and args.target (table names/queries) to the "sql_execute_read" permission check, which doesn't validate warehouse-level access; update the permission validation to explicitly include warehouse identifiers so both source and target warehouses are authorized — e.g., change the ctx.ask invocation in data-diff.ts (the call to ctx.ask for "sql_execute_read") to include args.source_warehouse and args.target_warehouse in the patterns or perform separate ctx.ask calls for each warehouse, ensuring both warehouses are checked before executing queries.

Not applicable — sql_execute_read is the correct permission. The data_diff tool only executes SELECT queries (via the Rust engine's cooperative protocol). No DDL/DML is ever executed. The permission check exists to surface the tool's database access to the user for approval.

@suryaiyer95, understood — thanks for the clarification. The sql_execute_read permission check is an approval surface for the user (not an access-control gate), and since the Rust engine's cooperative protocol only ever emits SELECT queries, the permission type and patterns are appropriate as-is.

✏️ Learnings added

Learnt from: suryaiyer95 Repo: AltimateAI/altimate-code PR: 493 File: packages/opencode/src/altimate/tools/data-diff.ts:82-87 Timestamp: 2026-04-03T22:44:39.345Z Learning: In `packages/opencode/src/altimate/tools/data-diff.ts`, the `ctx.ask({ permission: "sql_execute_read", ... })` call in `DataDiffTool.execute` is an **approval surface** — it surfaces the tool's pending database access to the user for explicit approval before any queries run. It is not a warehouse-level access-control check. The `data_diff` tool exclusively executes SELECT queries (via the Rust `DataParitySession` cooperative protocol — no DDL/DML is ever issued), so `sql_execute_read` is the correct and sufficient permission type. Passing `args.source` and `args.target` as patterns provides the user with context about what will be queried.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

suryaiyer95

Multi-Model Code Review — PR #493 + PR #114

Verdict: REQUEST CHANGES
Critical Issues: 1 | Major Issues: 6 | Minor Issues: 4 | Nits: 3

Reviewed by 3 AI models: Claude, Gemini 3.1 Pro, Grok 4. 1 convergence round. Full review posted on companion PR altimate-core-internal #114.

This PR's TypeScript orchestrator is well-crafted — the noLimit driver flag, CTE injection, and column auto-discovery are clean implementations. The issues below are specific to this PR.

Issues in this PR

isQuery detection could misclassify table names (Minor — Logic Error)

File: packages/opencode/src/altimate/native/connections/data-diff.ts:23-26
Regex /^\s*(SELECT|WITH|VALUES)\b/i misclassifies table names like with_metadata or select_results.
Fix: Document the limitation or require explicit signaling (e.g., a separate sourceIsQuery parameter).
Flagged by: Claude

Implicit truthy check on effectiveLimit (Nit — Code Quality)

File: packages/drivers/src/*.ts (all 12 driver files)
The limit-appending condition checks effectiveLimit && !/\bLIMIT\b/i.test(sql). Since noLimit: true sets effectiveLimit to 0 (falsy), this works but relies on implicit coercion.
Fix: Change to effectiveLimit > 0 for explicit intent.
Flagged by: Gemini

Positive Observations

noLimit flag — Clean, consistent addition across all 10+ drivers with backwards-compatible optional parameter
Auto-timestamp exclusion — Two-layer detection (name patterns + schema-level DEFAULT analysis) is smart
ClickHouse stderr silencing — clientConfig.log = { level: 127 } prevents raw ERROR output from corrupting terminal TUI
CTE injection — injectCte correctly handles merging CTE blocks when engine emits its own WITH clauses
Well-designed SKILL.md — Critical sections on algorithm selection, cost-aware interaction, joindiff cross-DB trap

Missing Tests

No test for isQuery edge cases (table names like with_metadata)
No test for CTE injection edge cases (nested/RECURSIVE CTEs)
No partitioned diff simulation tests end-to-end
No test for discoverExtraColumns with auto-timestamp default detection

See full review with all 14 findings and attribution table on altimate-core-internal PR #114.

suryaiyer95 · 2026-04-03T21:35:13Z

Follow-up: Review Issue Status (post-fixes)

All major issues from the review have been addressed. Here's the verified status:

FIXED

Issue	Evidence
Stale partition fallback outcome shape	Both fallback sites now use `{ mode: "diff", stats: { rows_table1: 0, ... }, diff_rows: [] }` — matches Rust schema
`effectiveLimit` truthy check	`truncated` check now uses `effectiveLimit > 0` across all 10+ drivers

STILL OPEN (minor)

isQuery regex misclassification

File: packages/opencode/src/altimate/native/connections/data-diff.ts:17-25
Still a pure regex test — table names like with_metadata or select_results would be misclassified as queries.
Low risk in practice (such table names are rare), but worth a comment or a fallback heuristic (e.g., reject if input contains spaces or parentheses).

…ng, query+partition guard - Oracle day granularity: 'DDD' (day-of-year) → 'DD' (day-of-month) - Add `quoteIdentForDialect()` helper: MySQL/ClickHouse use backticks, TSQL/Fabric use brackets, others use ANSI double-quotes - `buildPartitionDiscoverySQL` and `buildPartitionWhereClause` now use dialect-aware quoting instead of hardcoded double-quotes - `runPartitionedDiff` rejects SQL queries as source/target with a clear error — partitioning requires table names to discover column values

coderabbitai

🧹 Nitpick comments (1)

packages/opencode/src/altimate/native/connections/data-diff.ts (1)

766-773: Consider extracting resolveDialect as a module-level helper.

This function is duplicated at lines 623-630. Extract it once and reuse in both runPartitionedDiff and runDataDiff.

♻️ Proposed extraction

Add at module level (e.g., around line 400):

function resolveDialect(warehouse: string | undefined): string {
  if (warehouse) {
    const cfg = Registry.getConfig(warehouse)
    return cfg?.type ?? "generic"
  }
  const warehouses = Registry.list().warehouses
  return warehouses[0]?.type ?? "generic"
}

Then remove the local definitions in both functions and use the shared helper.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/opencode/src/altimate/native/connections/data-diff.ts` around lines
766 - 773, Extract the duplicated resolveDialect logic into a single
module-level helper named resolveDialect and replace the local copies inside
runPartitionedDiff and runDataDiff with calls to that helper; specifically,
remove the duplicate function definitions in both functions, add one shared
resolveDialect(warehouse: string | undefined): string at module scope (copying
the existing logic that uses Registry.getConfig and Registry.list().warehouses)
and update runPartitionedDiff and runDataDiff to call resolveDialect(warehouse)
instead of their local implementations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/opencode/src/altimate/native/connections/data-diff.ts`:
- Around line 766-773: Extract the duplicated resolveDialect logic into a single
module-level helper named resolveDialect and replace the local copies inside
runPartitionedDiff and runDataDiff with calls to that helper; specifically,
remove the duplicate function definitions in both functions, add one shared
resolveDialect(warehouse: string | undefined): string at module scope (copying
the existing logic that uses Registry.getConfig and Registry.list().warehouses)
and update runPartitionedDiff and runDataDiff to call resolveDialect(warehouse)
instead of their local implementations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d342a309-c72e-4721-8c31-8272ec693ffa

📥 Commits

Reviewing files that changed from the base of the PR and between e41e5a0 and b8147c9.

📒 Files selected for processing (1)

packages/opencode/src/altimate/native/connections/data-diff.ts

github-actions bot added the contributor label Mar 27, 2026

suryaiyer95 closed this Mar 27, 2026

suryaiyer95 reopened this Mar 27, 2026

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch 2 times, most recently from 0f8c7ac to 7909e55 Compare March 27, 2026 00:41

suryaiyer95 commented Mar 30, 2026

View reviewed changes

suryaiyer95 changed the title ~~feat: data-parity skill — algorithm guardrails and output style~~ feat: data-parity skill — TypeScript orchestrator, ClickHouse driver, partition support Mar 30, 2026

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 8c7ef31 to 1468eba Compare March 31, 2026 02:03

suryaiyer95 commented Apr 1, 2026

View reviewed changes

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from e1195ba to 161051d Compare April 1, 2026 23:31

suryaiyer95 and others added 14 commits April 2, 2026 11:47

refactor: remove existing-tool improvements — scope to data-diff only

05b6a02

refactor: revert .gitignore changes — scope to data-diff only

6c60be1

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 38daa6d to 2c58580 Compare April 2, 2026 23:19

suryaiyer95 added 4 commits April 2, 2026 17:06

chore: remove pack-local.ts — dev-only utility, not part of the feature

1bc67ef

feat: add data-parity skill to builder prompt with table and SQL quer…

e41e5a0

…y comparison modes

suryaiyer95 marked this pull request as ready for review April 3, 2026 21:04

Copilot AI review requested due to automatic review settings April 3, 2026 21:04

claude bot reviewed Apr 3, 2026

View reviewed changes

Copilot started reviewing on behalf of suryaiyer95 April 3, 2026 21:05 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

suryaiyer95 commented Apr 3, 2026

View reviewed changes

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

	connect(): Promise<void>
	connect(): Promise<void>
	execute(sql: string, limit?: number, binds?: any[]): Promise<ConnectorResult>

Conversation

suryaiyer95 commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core: TypeScript orchestrator (`data-diff.ts`, +799 lines)

New: ClickHouse driver (`clickhouse.ts`, +135 lines)

`ExecuteOptions` — all 10+ drivers updated

SKILL.md (+341 lines)

Fixes applied (from code review)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

suryaiyer95 commented Mar 27, 2026

Uh oh!

aidtya commented Mar 28, 2026

Companion fix: column name collision in join_diff_sql

Uh oh!

suryaiyer95 left a comment

Choose a reason for hiding this comment

Multi-Model Code Review — altimate-code#493

Major Issues

Minor Issues

NITs

Positive Observations

Uh oh!

suryaiyer95 left a comment

Choose a reason for hiding this comment

Code Review: Data Parity TypeScript Orchestrator + ClickHouse Driver

Major Issue (1)

1. SQL Injection via buildColumnDiscoverySQL — ClickHouse/Snowflake paths (Security)

Minor Issues (4)

2. Profile outcome formatting uses wrong field names — tools/data-diff.ts

3. CTE injection doesn't handle ClickHouse SETTINGS clause — data-diff.ts:injectCte()

4. Partition discovery fails on query sources — data-diff.ts:runPartitionedDiff()

5. noLimit creates unbounded memory risk for JoinDiff — all 11 driver files

Nit (1)

Positive Observations

Missing Tests

Uh oh!

suryaiyer95 commented Apr 1, 2026

Supplemental Review Findings

MAJOR: Partition discovery misses target-only partitions (data-diff.ts)

MAJOR: buildPartitionDiscoverySQL — unquoted column injection

MINOR: injectCte breaks on WITH RECURSIVE

MINOR: ClickHouse partition WHERE uses ANSI double-quotes instead of backticks

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

suryaiyer95 commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Companion fix: column name collision in `join_diff_sql`

1. SQL Injection via `buildColumnDiscoverySQL` — ClickHouse/Snowflake paths (Security)

2. Profile outcome formatting uses wrong field names — `tools/data-diff.ts`

3. CTE injection doesn't handle ClickHouse `SETTINGS` clause — `data-diff.ts:injectCte()`

4. Partition discovery fails on query sources — `data-diff.ts:runPartitionedDiff()`

5. `noLimit` creates unbounded memory risk for JoinDiff — all 11 driver files

MAJOR: Partition discovery misses target-only partitions (`data-diff.ts`)

MAJOR: `buildPartitionDiscoverySQL` — unquoted column injection

MINOR: `injectCte` breaks on `WITH RECURSIVE`

coderabbitai bot Apr 3, 2026 •

edited

Loading

coderabbitai bot Apr 3, 2026 •

edited

Loading

coderabbitai bot Apr 3, 2026 •

edited

Loading