fix: multi-dialect SQL parsing improvements from corpus testing by lustefaniak · Pull Request #55 · getsynq/sqlparser-rs

lustefaniak · 2026-03-20T08:18:47Z

Summary

21 parsing fixes across BigQuery, Snowflake, Redshift, Databricks, ClickHouse, and generic SQL dialects
Driven by corpus testing against ~137K real-world SQL files (98.7% pass rate, up from ~96%)
Key improvements: CONNECT BY hierarchical queries, expr.* wildcard expansion, PIVOT subqueries, compound array subscripts, EXECUTE IMMEDIATE, deferred JOIN ON clauses, and many more dialect-specific syntax fixes

Changes by dialect

Snowflake: CONNECT BY/PRIOR/CONNECT_BY_ROOT, SELECT * REPLACE, PIVOT IN subquery, function result subscript, COPY GRANTS COMMENT, SET TAG
BigQuery: expr.* struct wildcard, deferred ON in nested joins, NOT NULL on STRUCT fields, FULL UNION BY NAME, MERGE NOT MATCHED BY SOURCE/TARGET
Redshift: ARRAY(expr) as function call, compound subscript (col[tbl.idx]), table function column type defs, custom EXTRACT date parts + CAST field
Databricks: FILTER in expressions, ADD COLUMN AFTER
Cross-dialect: GROUPING SETS without inner parens, TABLESAMPLE PERCENT, GRANT/REVOKE wildcards, UNPIVOT column aliases, EXECUTE IMMEDIATE (Trino)
Tooling: corpus-runner log escape sequence handling

Fixes 7+ corpus test failures where BigQuery FULL UNION ALL BY NAME and FULL OUTER UNION ALL BY NAME set operations were misidentified as FULL OUTER JOIN attempts. - Add FullByName and FullAllByName variants to SetQuantifier enum - Update SetExpr::SetOperation Display to emit FULL prefix when needed - In parse_query_body, detect FULL [OUTER] UNION via lookahead and consume the prefix before parsing as a normal UNION set operation - In parse_table_and_joins, break out of join loop when FULL is followed by UNION (not JOIN) to avoid the "Expected JOIN, found: UNION" error

GROUPING SETS (a, b) without inner parens is valid SQL, treated as two single-column grouping sets. Uses parse_tuple(true, true) (lift_singleton) same as ROLLUP and CUBE already do. Fixes 994 corpus failures in unparsed_snowflake/select.

Fixes 328 corpus test failures for BigQuery MERGE statements using the BigQuery-specific WHEN NOT MATCHED BY SOURCE THEN UPDATE/DELETE and WHEN NOT MATCHED BY TARGET THEN INSERT syntax. Adds two new MergeClause variants: - NotMatchedBySourceUpdate { predicate, assignments } - NotMatchedBySourceDelete(Option<Expr>)

Snowflake allows COPY GRANTS COMMENT = '...' order in CREATE VIEW. Previously the parser only accepted COMMENT before COPY GRANTS. Also updates Display to emit COPY GRANTS before COMMENT for consistency. Fixes 193 corpus test failures in unparsed_snowflake/create_view.

…AFTER, and SET TAG - Handle FILTER (WHERE ...) as an infix operator in parse_infix so aggregate expressions with FILTER can participate in arithmetic (e.g., SUM(x) FILTER(WHERE cond) / COUNT(*)) - Support ADD COLUMN ... AFTER col_name in Databricks (and MySQL/Generic) column definitions - Support ALTER TABLE ... SET TAG qualified.tag.name = 'value' in Snowflake by skipping the optional TAG keyword before parsing SET options Fixes 13 customer corpus failures (8 databricks, 5 snowflake).

Adds parsing of NOT NULL constraint inside STRUCT<> type definitions, e.g. STRUCT<a INT64 NOT NULL, b BOOL NOT NULL>. Fixes 6 corpus test failures in customer_bigquery and bigquery dialects.

Adds SnowflakeDialect to the dialect check for wildcard REPLACE options, enabling SELECT * REPLACE (expr AS col_name, ...) and SELECT t.* REPLACE (...) syntax. Fixes 126 Snowflake corpus failures.

Extends parse_column_identifier_with_optional_comment to optionally skip a data type after a column identifier. This enables parsing of PostgreSQL/ Redshift table function syntax where column aliases include type definitions: FROM func() alias(col_name type, ...) e.g. FROM PG_GET_LATE_BINDING_VIEW_COLS() cols(view_schema name, col_num int) Fixes 702 corpus test failures in unparsed_redshift/declare/.

Allow UNPIVOT IN entries to specify an alias for single-column entries, e.g. `IN (jan AS january, feb AS february)`. Previously only multi-column tuple entries supported aliases (e.g. `(Q1, Q2) AS 'semester_1'`). The alias type was changed from `Option<Value>` to `Option<Ident>` so that identifiers, quoted strings, and numeric literals are all accepted uniformly. The Display was updated to omit parentheses for single-column entries regardless of alias presence. Fixes 4 corpus test failures across snowflake, redshift, and databricks.

…BY_ROOT and PRIOR operators Fixes 35 corpus test failures for Oracle/Snowflake hierarchical queries. - Add CONNECT_BY_ROOT keyword and PRIOR/ConnectByRoot unary operators - Add NOCYCLE keyword for CONNECT BY NOCYCLE clause - Add start_with and connect_by fields to Select struct - Parse START WITH and CONNECT BY clauses in parse_select - Handle PRIOR and CONNECT_BY_ROOT as prefix operators in parse_prefix - Reserve CONNECT and START as column alias keywords to prevent implicit alias capture - Add test_snowflake_connect_by test covering all hierarchical query patterns

Adds support for `EXECUTE IMMEDIATE 'sql' [USING p1, p2]` syntax used by Trino, Oracle, DB2, and other databases for dynamic SQL execution. Fixes 126 corpus test failures for unparsed_trino/execute queries.

Fixes 133 corpus test failures for SQL with pattern: FROM A INNER JOIN B LEFT JOIN C ON c_cond ON a_b_cond The last ON applies to the most recent join with JoinConstraint::None. This is valid BigQuery (and Standard SQL) nested join syntax where the ON clause for an outer join comes after the inner join's full specification.

Only parse ARRAY(...) as an array subquery when the content starts with a query keyword (SELECT, VALUES, WITH, TABLE). Otherwise, fall through to regular function call parsing. This fixes Redshift's ARRAY(JSON_PARSE(...)) and ARRAY(expr, ...) patterns where ARRAY is used as a function to construct arrays from expressions. Fixes 127 corpus test failures (119 redshift, 7 ansi, 1 hive).

…arsing Some corpus files (notably unparsed_redshift) come from query logs where actual newlines were escaped as literal \n / \r sequences. Unescape them before parsing so the parser sees proper whitespace tokens. Fixes 563 corpus test failures (365 + 170 + 28 in unparsed_redshift).

Parse each part of a GRANT/REVOKE object name allowing `*` as a wildcard in any position: `*.*`, `*`, `db.*`, `db.schema.*`. This is used in ClickHouse (GRANT SELECT ON *.* TO user) and other dialects that support database/table wildcard grants. Fixes 4 clickhouse corpus test failures.

TABLESAMPLE SYSTEM (10 PERCENT) is valid syntax in BigQuery and other dialects. The PERCENT keyword after the fraction value was causing a parse error ("Expected ), found: PERCENT"). Fixes 24 corpus test failures (23 bigquery, 1 ansi).

Adds SelectItem::ExprWildcard to handle CAST(...).*` syntax used in BigQuery to expand all fields of a struct expression as separate columns. The parse_cast_expr field-access loop was updated to stop before `.*` so parse_select_item can detect and handle the wildcard expansion. Fixes 25 corpus test failures (24 tier-1 customer/unparsed BigQuery).

…dx]) In Redshift, bracket-delimited identifiers like [col_name] were being mistakenly tokenized from expressions like col[tbl.idx] — treating [tbl.idx] as a bracket identifier rather than an array subscript. Two fixes: 1. redshift.rs: is_proper_identifier_inside_quotes now scans only up to the closing ] and rejects content containing dots, so [tbl.col] is treated as an array subscript rather than a bracket identifier. 2. parse_map_key: if a no-keyword word is followed by '.', fall back to parse_expr() to handle compound identifiers as subscript indices. Fixes 13 corpus test failures (all tier-1 unparsed_redshift).

…l results Fixes parse_map_key to use parse_expr() for function-call keys, enabling expressions like ARRAY_SIZE(SPLIT(col, '/')) - 1 as subscript indices. Fixes 21 corpus test failures for "Expected ), found: ]" error pattern.

Adds PivotValueSource::Subquery variant to allow subqueries inside PIVOT's IN clause (e.g. PIVOT(SUM(x) FOR col IN (SELECT DISTINCT col FROM t))). Fixes 12 corpus test failures in customer_snowflake and unparsed_snowflake.

Fixes 14 corpus test failures for Redshift EXTRACT expressions: - Allow abbreviated date/time parts (d, h, m, qtr, etc.) as custom identifiers in EXTRACT for RedshiftSqlDialect (previously only SnowflakeDialect and GenericDialect were supported) - Allow CAST('field' AS type) as the date/time field in EXTRACT for Redshift/Generic dialects; the inner string literal is mapped to the canonical DateTimeField variant (e.g. CAST('epoch' AS VARCHAR) → EPOCH) - Extract date_time_field_from_str() helper to share string-to-field mapping between the SingleQuotedString arm and the new CAST arm

github-actions · 2026-03-20T08:20:10Z

Corpus Parsing Report

Total: 135445 passed, 773 failed (99.4% pass rate)

Changes

✅ 99 improvement(s) (now passing)
➕ 31132 new test(s)

By Dialect

Dialect	Passed	Failed	Total	Pass Rate	Delta
ansi	598	114	712	84.0%	+15
bigquery	42917	79	42996	99.8%	+7724
clickhouse	3912	89	4001	97.8%	+4
databricks	1371	21	1392	98.5%	+10
duckdb	713	62	775	92.0%	+1
hive	37	24	61	60.7%	+5
mysql	151	68	219	68.9%	+1
postgres	1503	65	1568	95.9%	+1
redshift	14904	48	14952	99.7%	+1455
snowflake	69089	182	69271	99.7%	+21857
sqlite	44	16	60	73.3%	-
trino	206	5	211	97.6%	+126

✅ Improvements (99)

bigquery/alter/150bfff3a15e.sql
bigquery/create_table/b8802f365ec2.sql
bigquery/merge/323521517d8d.sql
bigquery/select/dadc9ca11b95.sql
clickhouse/_blocks/dba1317d8a02.sql
clickhouse/dcl/1ccdb967d5f9.sql
clickhouse/grant/b661475a38a0.sql
clickhouse/revoke/eadbb647b5bf.sql
customer_bigquery/create_table/17ee48dfcc59.sql
customer_bigquery/create_table/1d10a4d4f841.sql
... and 89 more

➕ New Tests (31132)

unparsed_bigquery/create_function/faf83b10c56a.sql
unparsed_bigquery/create_table/010e554ae252.sql
unparsed_bigquery/create_table/01c1c5cb4e3f.sql
unparsed_bigquery/create_table/029c2e0b4746.sql
unparsed_bigquery/create_table/02cbd9e383e4.sql
unparsed_bigquery/create_table/052b975d2bf9.sql
unparsed_bigquery/create_table/06e71cfdb58f.sql
unparsed_bigquery/create_table/070ac347b8fa.sql
unparsed_bigquery/create_table/0ade2954efd1.sql
unparsed_bigquery/create_table/0c1cb362e57e.sql
... and 31122 more

… ON DUPLICATE KEY Two fixes: 1. Tokenizer: Track prev_char to disambiguate bracket-delimited identifiers vs array subscripts. When '[' immediately follows a word character, ')' or ']' (no whitespace), treat as subscript not identifier. Previously is_proper_identifier_inside_quotes scanned past ']' and relied on accidentally finding '(' later in the file — a fragile heuristic that broke when we stopped scanning past ']'. Fixes: 1e7fddb56246, fef8d2b39914, d22ed4e35ac0 2. Parser: Deferred ON clause loop now checks for unconstrained joins before consuming ON keyword, preventing it from eating MySQL's ON DUPLICATE KEY UPDATE clause. Fixes: bc74e34ece55

- Redshift: ARRAY[col] inside function calls must parse as array literal, not as bracket-quoted identifier; bracket identifiers with spaces still work - MySQL: INSERT...SELECT...UNION SELECT...ON DUPLICATE KEY UPDATE must not have its ON consumed by the deferred join ON clause logic

lustefaniak added 22 commits March 20, 2026 02:23

fix-corpus-loop.sh

e878c88

fix(bigquery): support NOT NULL constraints on STRUCT field definitions

20dfea5

Adds parsing of NOT NULL constraint inside STRUCT<> type definitions, e.g. STRUCT<a INT64 NOT NULL, b BOOL NOT NULL>. Fixes 6 corpus test failures in customer_bigquery and bigquery dialects.

fix(snowflake): support SELECT * REPLACE (expr AS col) wildcard syntax

cbcf42c

Adds SnowflakeDialect to the dialect check for wildcard REPLACE options, enabling SELECT * REPLACE (expr AS col_name, ...) and SELECT t.* REPLACE (...) syntax. Fixes 126 Snowflake corpus failures.

fix(trino): support EXECUTE IMMEDIATE syntax with optional USING clause

272b04e

Adds support for `EXECUTE IMMEDIATE 'sql' [USING p1, p2]` syntax used by Trino, Oracle, DB2, and other databases for dynamic SQL execution. Fixes 126 corpus test failures for unparsed_trino/execute queries.

fix(snowflake): support subquery in PIVOT IN clause

11764cb

Adds PivotValueSource::Subquery variant to allow subqueries inside PIVOT's IN clause (e.g. PIVOT(SUM(x) FOR col IN (SELECT DISTINCT col FROM t))). Fixes 12 corpus test failures in customer_snowflake and unparsed_snowflake.

lustefaniak requested review from LiHRaM, grasskode and zdenal March 20, 2026 08:40

lustefaniak added 2 commits March 20, 2026 09:56

LiHRaM approved these changes Mar 20, 2026

View reviewed changes

lustefaniak merged commit d5137c6 into main Mar 20, 2026
3 checks passed

lustefaniak deleted the corpus-fixes branch March 20, 2026 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: multi-dialect SQL parsing improvements from corpus testing#55

fix: multi-dialect SQL parsing improvements from corpus testing#55
lustefaniak merged 24 commits intomainfrom
corpus-fixes

lustefaniak commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

lustefaniak commented Mar 20, 2026

Summary

Changes by dialect

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Corpus Parsing Report

Changes

By Dialect

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 20, 2026 •

edited

Loading