fix: batch corpus parser fixes — 26 syntax improvements across all dialects#58
Merged
lustefaniak merged 27 commits intomainfrom Mar 26, 2026
Merged
fix: batch corpus parser fixes — 26 syntax improvements across all dialects#58lustefaniak merged 27 commits intomainfrom
lustefaniak merged 27 commits intomainfrom
Conversation
…syntax Snowflake supports both `ARRAY` (unparameterized) and `ARRAY(INT)`, `ARRAY(DECIMAL(38,0))`, etc. Reuses the existing `ArrayElemTypeDef::Parenthesis` variant (shared with ClickHouse). Fixes 7 corpus test failures.
BigQuery supports `CREATE TABLE t COPY source_table` to clone tables. Adds a `copy` field to CreateTable AST, similar to the existing `clone` field. Carefully avoids consuming COPY when followed by GRANTS (Snowflake's COPY GRANTS syntax). Fixes 3 corpus test failures.
…ABLE/ALTER TABLE - CREATE SCHEMA: add OPTIONS(...) and DEFAULT COLLATE 'value' support - CREATE TABLE: add DEFAULT COLLATE 'value' support - ALTER TABLE: add SET DEFAULT COLLATE 'value' operation Fixes 11 corpus test failures (all BigQuery).
…syntax - MATCH_RECOGNIZE: consume balanced parens as opaque clause after table factors, preserving table reference for lineage. Fixes Snowflake pattern matching queries. - DEFAULT COLLATE: support on CREATE SCHEMA, CREATE TABLE, and ALTER TABLE SET DEFAULT COLLATE for BigQuery. - CREATE SCHEMA OPTIONS(...): support BigQuery schema-level options. Fixes 13 corpus test failures (Snowflake, BigQuery, Trino).
Snowflake's CREATE TABLE ... CLONE source BEFORE (STATEMENT => 'id') and AT (TIMESTAMP => ...) time travel clauses are now consumed as balanced parenthesized blocks after the CLONE source. Also adds BEFORE keyword to keywords.rs. Fixes 5 corpus test failures (Snowflake).
ClickHouse supports `LEFT ARRAY JOIN` in addition to `ARRAY JOIN`. Adds `JoinOperator::LeftArray` variant and parses it when LEFT is followed by ARRAY JOIN keyword. Fixes 7 corpus test failures (ClickHouse).
BigQuery aggregate functions like ANY_VALUE support `HAVING MAX expr` or `HAVING MIN expr` inside the function call to select the value at the row with the max/min of another column. Adds `HavingBound` enum and `having_bound` field to the `Function` AST. Parsed inside `parse_optional_args_with_orderby` before closing paren. Fixes 7 corpus test failures (BigQuery).
BigQuery templated UDFs use `ANY TYPE` as a parameter type for polymorphic functions. Parsed as a custom data type in the data type parser when `ANY` keyword is followed by `TYPE`. Fixes 5 corpus test failures (BigQuery).
ClickHouse uses `FORMAT Values` instead of just `VALUES` for INSERT statements. Treat `FORMAT VALUES` as equivalent to `VALUES` by consuming the FORMAT keyword and leaving VALUES for the normal query parser. Fixes 3 corpus test failures (ClickHouse).
BigQuery remote functions use `REMOTE WITH CONNECTION connection_name` clause. Consumed in the CREATE FUNCTION body parser. Fixes 2 corpus test failures (BigQuery).
Snowflake SHOW statements support `LIKE 'pattern' IN [SCHEMA|DATABASE] name` after the filter clause, not just before it. Parse the IN clause both before and after the LIKE/ILIKE/WHERE filter. Also handles SHOW FUNCTIONS ... IN CLASS name syntax. Fixes 3 corpus test failures (Snowflake).
Allow bare `TYPE data_type` (without `SET DATA`) in ALTER COLUMN for all dialects, not just PostgreSQL. This is supported by Databricks, Redshift, and others. Fixes 2 corpus test failures (Databricks, Redshift).
ClickHouse materialized views support `ALTER TABLE mv MODIFY QUERY SELECT ...` to change the underlying query. Adds `ModifyQuery` variant to `AlterTableOperation`. Fixes 2 corpus test failures (ClickHouse).
PIVOT ... IN (-1, 0, 1) now correctly handles negative number literals by consuming the minus sign before parse_value() and prepending it. Fixes 2 corpus test failures (BigQuery, ANSI).
MySQL supports `FOR SHARE OF t1, t2 SKIP LOCKED` with multiple table references. Consume extra comma-separated names after first.
- CODEC(codec_name) column option: consumed as balanced parens in parse_optional_column_option for ClickHouse dialect. - Type modifiers now handle = signs (e.g., Dynamic(max_types = 2)) and nested parentheses (e.g., Variant(UInt64, Array(UInt64))). Fixes 10 corpus test failures (ClickHouse).
ClickHouse/MySQL `RENAME TABLE old TO new` is now parsed as an equivalent `ALTER TABLE old RENAME TO new` statement. Fixes 2 corpus test failures (ClickHouse).
…tring ClickHouse overloads EXTRACT as a regex function: extract(str, pattern). When the first argument after ( is a string literal, parse as a regular function call instead of date extraction. Fixes 2 corpus test failures (ClickHouse).
Snowflake allows LIKE filter before the IN clause in SHOW COLUMNS. Parse the filter first, then expect FROM/IN. Fixes 2 corpus test failures (Snowflake).
Redshift supports `ALTER TABLE t ALTER DISTSTYLE KEY DISTKEY col`, `ALTER DISTSTYLE ALL/AUTO/EVEN`, and `ALTER [COMPOUND] SORTKEY (cols)`. These are consumed before the ALTER COLUMN path. Fixes 7 corpus test failures (Redshift).
Previously CREATE FUNCTION was only supported for a few specific dialects. Now all dialects can parse CREATE FUNCTION using the generic PostgreSQL/BigQuery path. Fixes 10 corpus test failures (ANSI, MySQL, etc).
- Snowflake: CREATE DATABASE ... CLONE source [BEFORE/AT (...)] time travel - Redshift: CREATE TABLE ... BACKUP YES/NO option Fixes 11 corpus test failures.
ALTER VIEW now handles SET SECURE, SET TBLPROPERTIES, SET COMMENT, UNSET SECURE, etc. by consuming remaining tokens after SET/UNSET. Fixes 3 corpus test failures (Snowflake, Hive).
Standard SQL `COMMENT ON object_type name IS 'comment'` is now parsed. Handles TABLE, COLUMN, DATABASE, SCHEMA, FUNCTION, PROCEDURE, VIEW, INDEX, TYPE, and ROLE object types. Fixes 4 corpus test failures (ANSI).
Remove the `--errors` flag — error messages are now always included in the corpus report. This eliminates a class of false-regression bugs caused by comparing reports from different modes (`fail:` vs bare `fail`). No performance impact since error messages were already computed regardless.
Corpus Parsing ReportTotal: 135573 passed, 645 failed (99.5% pass rate) Changes✅ 100 improvement(s) (now passing) By Dialect
✅ Improvements (100) |
Keep span information on aliases in EXPLODE AS (col1, col2) syntax so downstream CLL can use precise source locations for each alias.
LiHRaM
approved these changes
Mar 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Batch of 26 parser fixes that improve corpus pass rate from 135,473 to 135,573 (+100 files, 98.8% pass rate). Zero regressions across all changes.
Fixes by dialect
BigQuery (8 fixes):
CREATE TABLE ... COPY source_table(table cloning)DEFAULT COLLATE+OPTIONS(...)on CREATE SCHEMA/TABLE/ALTER TABLEHAVING MAX/MIN exprinside aggregate functions (e.g.,ANY_VALUE(x HAVING MAX y))ANY TYPEparameter type in CREATE FUNCTION (templated UDFs)REMOTE WITH CONNECTIONin CREATE FUNCTIONSnowflake (7 fixes):
ARRAY(element_type)parameterized array type (e.g.,CAST(x AS ARRAY(INT)))MATCH_RECOGNIZE(...)consumed as opaque table factorCLONE ... BEFORE/AT (STATEMENT => 'id')time travelCREATE DATABASE ... CLONE source BEFORE (...)SHOW TABLES/FUNCTIONS ... LIKE 'pattern' IN [SCHEMA] nameSHOW COLUMNS LIKE 'pattern' IN TABLE/VIEW nameClickHouse (7 fixes):
LEFT ARRAY JOININSERT INTO ... FORMAT Values (...)ALTER TABLE MODIFY QUERY SELECT ...(materialized views)CODEC(...)column option=signs and nested parens (e.g.,Dynamic(max_types = 2),Variant(UInt64, Array(UInt64)))RENAME TABLE old TO newEXTRACT(str, pattern)as regex function (vs date extraction)Redshift (2 fixes):
ALTER TABLE ALTER DISTSTYLE/SORTKEYoperationsCREATE TABLE ... BACKUP YES/NOCross-dialect (4 fixes):
ALTER COLUMN TYPEshorthand (was PostgreSQL-only, now all dialects)CREATE FUNCTIONfor all dialects (was limited to ~5 dialects)COMMENT ON TABLE/COLUMN/DATABASE ... IS 'comment'ALTER VIEW SET/UNSEToperations (Snowflake SET SECURE, Hive SET TBLPROPERTIES)FOR SHARE/UPDATE OF t1, t2with multiple tablesInfrastructure
--errorsflag — error messages are now always included in reports, eliminating false-regression bugs from format mismatchCorpus results