Skip to content

fix: batch corpus parser fixes — 26 syntax improvements across all dialects#58

Merged
lustefaniak merged 27 commits intomainfrom
lukasz-corpus-fixes-2026-03-26
Mar 26, 2026
Merged

fix: batch corpus parser fixes — 26 syntax improvements across all dialects#58
lustefaniak merged 27 commits intomainfrom
lukasz-corpus-fixes-2026-03-26

Conversation

@lustefaniak
Copy link
Copy Markdown
Collaborator

Summary

Batch of 26 parser fixes that improve corpus pass rate from 135,473 to 135,573 (+100 files, 98.8% pass rate). Zero regressions across all changes.

Fixes by dialect

BigQuery (8 fixes):

  • CREATE TABLE ... COPY source_table (table cloning)
  • DEFAULT COLLATE + OPTIONS(...) on CREATE SCHEMA/TABLE/ALTER TABLE
  • HAVING MAX/MIN expr inside aggregate functions (e.g., ANY_VALUE(x HAVING MAX y))
  • ANY TYPE parameter type in CREATE FUNCTION (templated UDFs)
  • REMOTE WITH CONNECTION in CREATE FUNCTION
  • Negative numbers in PIVOT IN value lists

Snowflake (7 fixes):

  • ARRAY(element_type) parameterized array type (e.g., CAST(x AS ARRAY(INT)))
  • MATCH_RECOGNIZE(...) consumed as opaque table factor
  • CLONE ... BEFORE/AT (STATEMENT => 'id') time travel
  • CREATE DATABASE ... CLONE source BEFORE (...)
  • SHOW TABLES/FUNCTIONS ... LIKE 'pattern' IN [SCHEMA] name
  • SHOW COLUMNS LIKE 'pattern' IN TABLE/VIEW name

ClickHouse (7 fixes):

  • LEFT ARRAY JOIN
  • INSERT INTO ... FORMAT Values (...)
  • ALTER TABLE MODIFY QUERY SELECT ... (materialized views)
  • CODEC(...) column option
  • Enhanced type modifiers: = signs and nested parens (e.g., Dynamic(max_types = 2), Variant(UInt64, Array(UInt64)))
  • RENAME TABLE old TO new
  • EXTRACT(str, pattern) as regex function (vs date extraction)

Redshift (2 fixes):

  • ALTER TABLE ALTER DISTSTYLE/SORTKEY operations
  • CREATE TABLE ... BACKUP YES/NO

Cross-dialect (4 fixes):

  • ALTER COLUMN TYPE shorthand (was PostgreSQL-only, now all dialects)
  • CREATE FUNCTION for all dialects (was limited to ~5 dialects)
  • COMMENT ON TABLE/COLUMN/DATABASE ... IS 'comment'
  • ALTER VIEW SET/UNSET operations (Snowflake SET SECURE, Hive SET TBLPROPERTIES)
  • FOR SHARE/UPDATE OF t1, t2 with multiple tables

Infrastructure

  • corpus-runner: removed --errors flag — error messages are now always included in reports, eliminating false-regression bugs from format mismatch
  • CLAUDE.md: added balanced paren pattern, dialect struct names, high-churn struct warnings

Corpus results

Total files:  137,169
Passed:       135,573 (98.8%)
Failed:       645 (0.5%)

BigQuery:    99.9%  |  Snowflake:  99.8%  |  Redshift:  99.8%
Databricks:  98.7%  |  ClickHouse: 98.2%  |  Trino:     98.1%
PostgreSQL:  95.9%  |  DuckDB:     92.5%  |  ANSI:      85.5%

…syntax

Snowflake supports both `ARRAY` (unparameterized) and `ARRAY(INT)`,
`ARRAY(DECIMAL(38,0))`, etc. Reuses the existing `ArrayElemTypeDef::Parenthesis`
variant (shared with ClickHouse). Fixes 7 corpus test failures.
BigQuery supports `CREATE TABLE t COPY source_table` to clone tables.
Adds a `copy` field to CreateTable AST, similar to the existing `clone`
field. Carefully avoids consuming COPY when followed by GRANTS
(Snowflake's COPY GRANTS syntax).

Fixes 3 corpus test failures.
…ABLE/ALTER TABLE

- CREATE SCHEMA: add OPTIONS(...) and DEFAULT COLLATE 'value' support
- CREATE TABLE: add DEFAULT COLLATE 'value' support
- ALTER TABLE: add SET DEFAULT COLLATE 'value' operation

Fixes 11 corpus test failures (all BigQuery).
…syntax

- MATCH_RECOGNIZE: consume balanced parens as opaque clause after table
  factors, preserving table reference for lineage. Fixes Snowflake pattern
  matching queries.
- DEFAULT COLLATE: support on CREATE SCHEMA, CREATE TABLE, and
  ALTER TABLE SET DEFAULT COLLATE for BigQuery.
- CREATE SCHEMA OPTIONS(...): support BigQuery schema-level options.

Fixes 13 corpus test failures (Snowflake, BigQuery, Trino).
Snowflake's CREATE TABLE ... CLONE source BEFORE (STATEMENT => 'id')
and AT (TIMESTAMP => ...) time travel clauses are now consumed as
balanced parenthesized blocks after the CLONE source.

Also adds BEFORE keyword to keywords.rs.

Fixes 5 corpus test failures (Snowflake).
ClickHouse supports `LEFT ARRAY JOIN` in addition to `ARRAY JOIN`.
Adds `JoinOperator::LeftArray` variant and parses it when LEFT is
followed by ARRAY JOIN keyword.

Fixes 7 corpus test failures (ClickHouse).
BigQuery aggregate functions like ANY_VALUE support `HAVING MAX expr`
or `HAVING MIN expr` inside the function call to select the value at
the row with the max/min of another column.

Adds `HavingBound` enum and `having_bound` field to the `Function` AST.
Parsed inside `parse_optional_args_with_orderby` before closing paren.

Fixes 7 corpus test failures (BigQuery).
BigQuery templated UDFs use `ANY TYPE` as a parameter type for
polymorphic functions. Parsed as a custom data type in the data type
parser when `ANY` keyword is followed by `TYPE`.

Fixes 5 corpus test failures (BigQuery).
ClickHouse uses `FORMAT Values` instead of just `VALUES` for INSERT
statements. Treat `FORMAT VALUES` as equivalent to `VALUES` by consuming
the FORMAT keyword and leaving VALUES for the normal query parser.

Fixes 3 corpus test failures (ClickHouse).
BigQuery remote functions use `REMOTE WITH CONNECTION connection_name`
clause. Consumed in the CREATE FUNCTION body parser.

Fixes 2 corpus test failures (BigQuery).
Snowflake SHOW statements support `LIKE 'pattern' IN [SCHEMA|DATABASE]
name` after the filter clause, not just before it. Parse the IN clause
both before and after the LIKE/ILIKE/WHERE filter.

Also handles SHOW FUNCTIONS ... IN CLASS name syntax.

Fixes 3 corpus test failures (Snowflake).
Allow bare `TYPE data_type` (without `SET DATA`) in ALTER COLUMN for all
dialects, not just PostgreSQL. This is supported by Databricks, Redshift,
and others.

Fixes 2 corpus test failures (Databricks, Redshift).
ClickHouse materialized views support `ALTER TABLE mv MODIFY QUERY
SELECT ...` to change the underlying query. Adds `ModifyQuery` variant
to `AlterTableOperation`.

Fixes 2 corpus test failures (ClickHouse).
PIVOT ... IN (-1, 0, 1) now correctly handles negative number literals
by consuming the minus sign before parse_value() and prepending it.

Fixes 2 corpus test failures (BigQuery, ANSI).
MySQL supports `FOR SHARE OF t1, t2 SKIP LOCKED` with multiple
table references. Consume extra comma-separated names after first.
- CODEC(codec_name) column option: consumed as balanced parens in
  parse_optional_column_option for ClickHouse dialect.
- Type modifiers now handle = signs (e.g., Dynamic(max_types = 2))
  and nested parentheses (e.g., Variant(UInt64, Array(UInt64))).

Fixes 10 corpus test failures (ClickHouse).
ClickHouse/MySQL `RENAME TABLE old TO new` is now parsed as an
equivalent `ALTER TABLE old RENAME TO new` statement.

Fixes 2 corpus test failures (ClickHouse).
…tring

ClickHouse overloads EXTRACT as a regex function: extract(str, pattern).
When the first argument after ( is a string literal, parse as a regular
function call instead of date extraction.

Fixes 2 corpus test failures (ClickHouse).
Snowflake allows LIKE filter before the IN clause in SHOW COLUMNS.
Parse the filter first, then expect FROM/IN.

Fixes 2 corpus test failures (Snowflake).
Redshift supports `ALTER TABLE t ALTER DISTSTYLE KEY DISTKEY col`,
`ALTER DISTSTYLE ALL/AUTO/EVEN`, and `ALTER [COMPOUND] SORTKEY (cols)`.
These are consumed before the ALTER COLUMN path.

Fixes 7 corpus test failures (Redshift).
Previously CREATE FUNCTION was only supported for a few specific
dialects. Now all dialects can parse CREATE FUNCTION using the
generic PostgreSQL/BigQuery path.

Fixes 10 corpus test failures (ANSI, MySQL, etc).
- Snowflake: CREATE DATABASE ... CLONE source [BEFORE/AT (...)] time travel
- Redshift: CREATE TABLE ... BACKUP YES/NO option

Fixes 11 corpus test failures.
ALTER VIEW now handles SET SECURE, SET TBLPROPERTIES, SET COMMENT,
UNSET SECURE, etc. by consuming remaining tokens after SET/UNSET.

Fixes 3 corpus test failures (Snowflake, Hive).
Standard SQL `COMMENT ON object_type name IS 'comment'` is now parsed.
Handles TABLE, COLUMN, DATABASE, SCHEMA, FUNCTION, PROCEDURE, VIEW,
INDEX, TYPE, and ROLE object types.

Fixes 4 corpus test failures (ANSI).
Remove the `--errors` flag — error messages are now always included in
the corpus report. This eliminates a class of false-regression bugs
caused by comparing reports from different modes (`fail:` vs bare `fail`).

No performance impact since error messages were already computed regardless.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

Corpus Parsing Report

Total: 135573 passed, 645 failed (99.5% pass rate)

Changes

100 improvement(s) (now passing)

By Dialect

Dialect Passed Failed Total Pass Rate Delta
ansi 609 103 712 85.5% +11
bigquery 42946 50 42996 99.9% +27
clickhouse 3929 72 4001 98.2% +17
databricks 1374 18 1392 98.7% +1
duckdb 717 58 775 92.5% -
hive 38 23 61 62.3% +1
mysql 151 68 219 68.9% -
postgres 1504 64 1568 95.9% -
redshift 14923 29 14952 99.8% +15
snowflake 69131 140 69271 99.8% +27
sqlite 44 16 60 73.3% -
trino 207 4 211 98.1% +1
✅ Improvements (100)
bigquery/_blocks/82030aa613f4.sql
bigquery/_blocks/fe7bb3079d59.sql
bigquery/alter/71c0731c7749.sql
bigquery/alter/ad81983a82e1.sql
bigquery/create_function/5c06e755547d.sql
bigquery/create_function/c63e5addf89f.sql
bigquery/create_function/c8366a074228.sql
bigquery/create_function/e2b2e552f736.sql
bigquery/create_schema/1a38ebcf7bf6.sql
bigquery/create_schema/460af1c56a48.sql
... and 90 more

Keep span information on aliases in EXPLODE AS (col1, col2) syntax
so downstream CLL can use precise source locations for each alias.
@lustefaniak lustefaniak merged commit 70e6c40 into main Mar 26, 2026
4 checks passed
@lustefaniak lustefaniak deleted the lukasz-corpus-fixes-2026-03-26 branch March 26, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants