Skip to content

fix: multi-dialect SQL parsing improvements from corpus testing#55

Merged
lustefaniak merged 24 commits intomainfrom
corpus-fixes
Mar 20, 2026
Merged

fix: multi-dialect SQL parsing improvements from corpus testing#55
lustefaniak merged 24 commits intomainfrom
corpus-fixes

Conversation

@lustefaniak
Copy link
Copy Markdown
Collaborator

Summary

  • 21 parsing fixes across BigQuery, Snowflake, Redshift, Databricks, ClickHouse, and generic SQL dialects
  • Driven by corpus testing against ~137K real-world SQL files (98.7% pass rate, up from ~96%)
  • Key improvements: CONNECT BY hierarchical queries, expr.* wildcard expansion, PIVOT subqueries, compound array subscripts, EXECUTE IMMEDIATE, deferred JOIN ON clauses, and many more dialect-specific syntax fixes

Changes by dialect

  • Snowflake: CONNECT BY/PRIOR/CONNECT_BY_ROOT, SELECT * REPLACE, PIVOT IN subquery, function result subscript, COPY GRANTS COMMENT, SET TAG
  • BigQuery: expr.* struct wildcard, deferred ON in nested joins, NOT NULL on STRUCT fields, FULL UNION BY NAME, MERGE NOT MATCHED BY SOURCE/TARGET
  • Redshift: ARRAY(expr) as function call, compound subscript (col[tbl.idx]), table function column type defs, custom EXTRACT date parts + CAST field
  • Databricks: FILTER in expressions, ADD COLUMN AFTER
  • Cross-dialect: GROUPING SETS without inner parens, TABLESAMPLE PERCENT, GRANT/REVOKE wildcards, UNPIVOT column aliases, EXECUTE IMMEDIATE (Trino)
  • Tooling: corpus-runner log escape sequence handling

Fixes 7+ corpus test failures where BigQuery FULL UNION ALL BY NAME
and FULL OUTER UNION ALL BY NAME set operations were misidentified as
FULL OUTER JOIN attempts.

- Add FullByName and FullAllByName variants to SetQuantifier enum
- Update SetExpr::SetOperation Display to emit FULL prefix when needed
- In parse_query_body, detect FULL [OUTER] UNION via lookahead and
  consume the prefix before parsing as a normal UNION set operation
- In parse_table_and_joins, break out of join loop when FULL is followed
  by UNION (not JOIN) to avoid the "Expected JOIN, found: UNION" error
GROUPING SETS (a, b) without inner parens is valid SQL, treated as two
single-column grouping sets. Uses parse_tuple(true, true) (lift_singleton)
same as ROLLUP and CUBE already do. Fixes 994 corpus failures in
unparsed_snowflake/select.
Fixes 328 corpus test failures for BigQuery MERGE statements using
the BigQuery-specific WHEN NOT MATCHED BY SOURCE THEN UPDATE/DELETE
and WHEN NOT MATCHED BY TARGET THEN INSERT syntax.

Adds two new MergeClause variants:
- NotMatchedBySourceUpdate { predicate, assignments }
- NotMatchedBySourceDelete(Option<Expr>)
Snowflake allows COPY GRANTS COMMENT = '...' order in CREATE VIEW.
Previously the parser only accepted COMMENT before COPY GRANTS.

Also updates Display to emit COPY GRANTS before COMMENT for consistency.

Fixes 193 corpus test failures in unparsed_snowflake/create_view.
…AFTER, and SET TAG

- Handle FILTER (WHERE ...) as an infix operator in parse_infix so aggregate
  expressions with FILTER can participate in arithmetic (e.g., SUM(x) FILTER(WHERE cond) / COUNT(*))
- Support ADD COLUMN ... AFTER col_name in Databricks (and MySQL/Generic) column definitions
- Support ALTER TABLE ... SET TAG qualified.tag.name = 'value' in Snowflake by
  skipping the optional TAG keyword before parsing SET options

Fixes 13 customer corpus failures (8 databricks, 5 snowflake).
Adds parsing of NOT NULL constraint inside STRUCT<> type definitions,
e.g. STRUCT<a INT64 NOT NULL, b BOOL NOT NULL>. Fixes 6 corpus test
failures in customer_bigquery and bigquery dialects.
Adds SnowflakeDialect to the dialect check for wildcard REPLACE options,
enabling SELECT * REPLACE (expr AS col_name, ...) and
SELECT t.* REPLACE (...) syntax. Fixes 126 Snowflake corpus failures.
Extends parse_column_identifier_with_optional_comment to optionally skip
a data type after a column identifier. This enables parsing of PostgreSQL/
Redshift table function syntax where column aliases include type definitions:
  FROM func() alias(col_name type, ...)
e.g. FROM PG_GET_LATE_BINDING_VIEW_COLS() cols(view_schema name, col_num int)

Fixes 702 corpus test failures in unparsed_redshift/declare/.
Allow UNPIVOT IN entries to specify an alias for single-column entries,
e.g. `IN (jan AS january, feb AS february)`. Previously only multi-column
tuple entries supported aliases (e.g. `(Q1, Q2) AS 'semester_1'`).

The alias type was changed from `Option<Value>` to `Option<Ident>` so
that identifiers, quoted strings, and numeric literals are all accepted
uniformly. The Display was updated to omit parentheses for single-column
entries regardless of alias presence.

Fixes 4 corpus test failures across snowflake, redshift, and databricks.
…BY_ROOT and PRIOR operators

Fixes 35 corpus test failures for Oracle/Snowflake hierarchical queries.

- Add CONNECT_BY_ROOT keyword and PRIOR/ConnectByRoot unary operators
- Add NOCYCLE keyword for CONNECT BY NOCYCLE clause
- Add start_with and connect_by fields to Select struct
- Parse START WITH and CONNECT BY clauses in parse_select
- Handle PRIOR and CONNECT_BY_ROOT as prefix operators in parse_prefix
- Reserve CONNECT and START as column alias keywords to prevent implicit alias capture
- Add test_snowflake_connect_by test covering all hierarchical query patterns
Adds support for `EXECUTE IMMEDIATE 'sql' [USING p1, p2]` syntax used by
Trino, Oracle, DB2, and other databases for dynamic SQL execution.

Fixes 126 corpus test failures for unparsed_trino/execute queries.
Fixes 133 corpus test failures for SQL with pattern:
FROM A INNER JOIN B LEFT JOIN C ON c_cond ON a_b_cond

The last ON applies to the most recent join with JoinConstraint::None.
This is valid BigQuery (and Standard SQL) nested join syntax where the
ON clause for an outer join comes after the inner join's full specification.
Only parse ARRAY(...) as an array subquery when the content starts with
a query keyword (SELECT, VALUES, WITH, TABLE). Otherwise, fall through
to regular function call parsing.

This fixes Redshift's ARRAY(JSON_PARSE(...)) and ARRAY(expr, ...) patterns
where ARRAY is used as a function to construct arrays from expressions.

Fixes 127 corpus test failures (119 redshift, 7 ansi, 1 hive).
…arsing

Some corpus files (notably unparsed_redshift) come from query logs where
actual newlines were escaped as literal \n / \r sequences. Unescape them
before parsing so the parser sees proper whitespace tokens.

Fixes 563 corpus test failures (365 + 170 + 28 in unparsed_redshift).
Parse each part of a GRANT/REVOKE object name allowing `*` as a wildcard
in any position: `*.*`, `*`, `db.*`, `db.schema.*`.

This is used in ClickHouse (GRANT SELECT ON *.* TO user) and other dialects
that support database/table wildcard grants.

Fixes 4 clickhouse corpus test failures.
TABLESAMPLE SYSTEM (10 PERCENT) is valid syntax in BigQuery and other
dialects. The PERCENT keyword after the fraction value was causing a
parse error ("Expected ), found: PERCENT").

Fixes 24 corpus test failures (23 bigquery, 1 ansi).
Adds SelectItem::ExprWildcard to handle CAST(...).*` syntax used in
BigQuery to expand all fields of a struct expression as separate columns.

The parse_cast_expr field-access loop was updated to stop before `.*`
so parse_select_item can detect and handle the wildcard expansion.

Fixes 25 corpus test failures (24 tier-1 customer/unparsed BigQuery).
…dx])

In Redshift, bracket-delimited identifiers like [col_name] were being
mistakenly tokenized from expressions like col[tbl.idx] — treating
[tbl.idx] as a bracket identifier rather than an array subscript.

Two fixes:
1. redshift.rs: is_proper_identifier_inside_quotes now scans only up
   to the closing ] and rejects content containing dots, so [tbl.col]
   is treated as an array subscript rather than a bracket identifier.
2. parse_map_key: if a no-keyword word is followed by '.', fall back
   to parse_expr() to handle compound identifiers as subscript indices.

Fixes 13 corpus test failures (all tier-1 unparsed_redshift).
…l results

Fixes parse_map_key to use parse_expr() for function-call keys, enabling
expressions like ARRAY_SIZE(SPLIT(col, '/')) - 1 as subscript indices.

Fixes 21 corpus test failures for "Expected ), found: ]" error pattern.
Adds PivotValueSource::Subquery variant to allow subqueries inside PIVOT's
IN clause (e.g. PIVOT(SUM(x) FOR col IN (SELECT DISTINCT col FROM t))).
Fixes 12 corpus test failures in customer_snowflake and unparsed_snowflake.
Fixes 14 corpus test failures for Redshift EXTRACT expressions:
- Allow abbreviated date/time parts (d, h, m, qtr, etc.) as custom
  identifiers in EXTRACT for RedshiftSqlDialect (previously only
  SnowflakeDialect and GenericDialect were supported)
- Allow CAST('field' AS type) as the date/time field in EXTRACT for
  Redshift/Generic dialects; the inner string literal is mapped to the
  canonical DateTimeField variant (e.g. CAST('epoch' AS VARCHAR) → EPOCH)
- Extract date_time_field_from_str() helper to share string-to-field
  mapping between the SingleQuotedString arm and the new CAST arm
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

Corpus Parsing Report

Total: 135445 passed, 773 failed (99.4% pass rate)

Changes

99 improvement(s) (now passing)
31132 new test(s)

By Dialect

Dialect Passed Failed Total Pass Rate Delta
ansi 598 114 712 84.0% +15
bigquery 42917 79 42996 99.8% +7724
clickhouse 3912 89 4001 97.8% +4
databricks 1371 21 1392 98.5% +10
duckdb 713 62 775 92.0% +1
hive 37 24 61 60.7% +5
mysql 151 68 219 68.9% +1
postgres 1503 65 1568 95.9% +1
redshift 14904 48 14952 99.7% +1455
snowflake 69089 182 69271 99.7% +21857
sqlite 44 16 60 73.3% -
trino 206 5 211 97.6% +126
✅ Improvements (99)
bigquery/alter/150bfff3a15e.sql
bigquery/create_table/b8802f365ec2.sql
bigquery/merge/323521517d8d.sql
bigquery/select/dadc9ca11b95.sql
clickhouse/_blocks/dba1317d8a02.sql
clickhouse/dcl/1ccdb967d5f9.sql
clickhouse/grant/b661475a38a0.sql
clickhouse/revoke/eadbb647b5bf.sql
customer_bigquery/create_table/17ee48dfcc59.sql
customer_bigquery/create_table/1d10a4d4f841.sql
... and 89 more
➕ New Tests (31132)
unparsed_bigquery/create_function/faf83b10c56a.sql
unparsed_bigquery/create_table/010e554ae252.sql
unparsed_bigquery/create_table/01c1c5cb4e3f.sql
unparsed_bigquery/create_table/029c2e0b4746.sql
unparsed_bigquery/create_table/02cbd9e383e4.sql
unparsed_bigquery/create_table/052b975d2bf9.sql
unparsed_bigquery/create_table/06e71cfdb58f.sql
unparsed_bigquery/create_table/070ac347b8fa.sql
unparsed_bigquery/create_table/0ade2954efd1.sql
unparsed_bigquery/create_table/0c1cb362e57e.sql
... and 31122 more

… ON DUPLICATE KEY

Two fixes:

1. Tokenizer: Track prev_char to disambiguate bracket-delimited identifiers
   vs array subscripts. When '[' immediately follows a word character, ')'
   or ']' (no whitespace), treat as subscript not identifier. Previously
   is_proper_identifier_inside_quotes scanned past ']' and relied on
   accidentally finding '(' later in the file — a fragile heuristic that
   broke when we stopped scanning past ']'.
   Fixes: 1e7fddb56246, fef8d2b39914, d22ed4e35ac0

2. Parser: Deferred ON clause loop now checks for unconstrained joins
   before consuming ON keyword, preventing it from eating MySQL's
   ON DUPLICATE KEY UPDATE clause.
   Fixes: bc74e34ece55
- Redshift: ARRAY[col] inside function calls must parse as array literal,
  not as bracket-quoted identifier; bracket identifiers with spaces still work
- MySQL: INSERT...SELECT...UNION SELECT...ON DUPLICATE KEY UPDATE must not
  have its ON consumed by the deferred join ON clause logic
@lustefaniak lustefaniak merged commit d5137c6 into main Mar 20, 2026
3 checks passed
@lustefaniak lustefaniak deleted the corpus-fixes branch March 20, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants