fix: multi-dialect SQL parsing improvements from corpus testing#55
Merged
lustefaniak merged 24 commits intomainfrom Mar 20, 2026
Merged
fix: multi-dialect SQL parsing improvements from corpus testing#55lustefaniak merged 24 commits intomainfrom
lustefaniak merged 24 commits intomainfrom
Conversation
Fixes 7+ corpus test failures where BigQuery FULL UNION ALL BY NAME and FULL OUTER UNION ALL BY NAME set operations were misidentified as FULL OUTER JOIN attempts. - Add FullByName and FullAllByName variants to SetQuantifier enum - Update SetExpr::SetOperation Display to emit FULL prefix when needed - In parse_query_body, detect FULL [OUTER] UNION via lookahead and consume the prefix before parsing as a normal UNION set operation - In parse_table_and_joins, break out of join loop when FULL is followed by UNION (not JOIN) to avoid the "Expected JOIN, found: UNION" error
GROUPING SETS (a, b) without inner parens is valid SQL, treated as two single-column grouping sets. Uses parse_tuple(true, true) (lift_singleton) same as ROLLUP and CUBE already do. Fixes 994 corpus failures in unparsed_snowflake/select.
Fixes 328 corpus test failures for BigQuery MERGE statements using
the BigQuery-specific WHEN NOT MATCHED BY SOURCE THEN UPDATE/DELETE
and WHEN NOT MATCHED BY TARGET THEN INSERT syntax.
Adds two new MergeClause variants:
- NotMatchedBySourceUpdate { predicate, assignments }
- NotMatchedBySourceDelete(Option<Expr>)
Snowflake allows COPY GRANTS COMMENT = '...' order in CREATE VIEW. Previously the parser only accepted COMMENT before COPY GRANTS. Also updates Display to emit COPY GRANTS before COMMENT for consistency. Fixes 193 corpus test failures in unparsed_snowflake/create_view.
…AFTER, and SET TAG - Handle FILTER (WHERE ...) as an infix operator in parse_infix so aggregate expressions with FILTER can participate in arithmetic (e.g., SUM(x) FILTER(WHERE cond) / COUNT(*)) - Support ADD COLUMN ... AFTER col_name in Databricks (and MySQL/Generic) column definitions - Support ALTER TABLE ... SET TAG qualified.tag.name = 'value' in Snowflake by skipping the optional TAG keyword before parsing SET options Fixes 13 customer corpus failures (8 databricks, 5 snowflake).
Adds parsing of NOT NULL constraint inside STRUCT<> type definitions, e.g. STRUCT<a INT64 NOT NULL, b BOOL NOT NULL>. Fixes 6 corpus test failures in customer_bigquery and bigquery dialects.
Adds SnowflakeDialect to the dialect check for wildcard REPLACE options, enabling SELECT * REPLACE (expr AS col_name, ...) and SELECT t.* REPLACE (...) syntax. Fixes 126 Snowflake corpus failures.
Extends parse_column_identifier_with_optional_comment to optionally skip a data type after a column identifier. This enables parsing of PostgreSQL/ Redshift table function syntax where column aliases include type definitions: FROM func() alias(col_name type, ...) e.g. FROM PG_GET_LATE_BINDING_VIEW_COLS() cols(view_schema name, col_num int) Fixes 702 corpus test failures in unparsed_redshift/declare/.
Allow UNPIVOT IN entries to specify an alias for single-column entries, e.g. `IN (jan AS january, feb AS february)`. Previously only multi-column tuple entries supported aliases (e.g. `(Q1, Q2) AS 'semester_1'`). The alias type was changed from `Option<Value>` to `Option<Ident>` so that identifiers, quoted strings, and numeric literals are all accepted uniformly. The Display was updated to omit parentheses for single-column entries regardless of alias presence. Fixes 4 corpus test failures across snowflake, redshift, and databricks.
…BY_ROOT and PRIOR operators Fixes 35 corpus test failures for Oracle/Snowflake hierarchical queries. - Add CONNECT_BY_ROOT keyword and PRIOR/ConnectByRoot unary operators - Add NOCYCLE keyword for CONNECT BY NOCYCLE clause - Add start_with and connect_by fields to Select struct - Parse START WITH and CONNECT BY clauses in parse_select - Handle PRIOR and CONNECT_BY_ROOT as prefix operators in parse_prefix - Reserve CONNECT and START as column alias keywords to prevent implicit alias capture - Add test_snowflake_connect_by test covering all hierarchical query patterns
Adds support for `EXECUTE IMMEDIATE 'sql' [USING p1, p2]` syntax used by Trino, Oracle, DB2, and other databases for dynamic SQL execution. Fixes 126 corpus test failures for unparsed_trino/execute queries.
Fixes 133 corpus test failures for SQL with pattern: FROM A INNER JOIN B LEFT JOIN C ON c_cond ON a_b_cond The last ON applies to the most recent join with JoinConstraint::None. This is valid BigQuery (and Standard SQL) nested join syntax where the ON clause for an outer join comes after the inner join's full specification.
Only parse ARRAY(...) as an array subquery when the content starts with a query keyword (SELECT, VALUES, WITH, TABLE). Otherwise, fall through to regular function call parsing. This fixes Redshift's ARRAY(JSON_PARSE(...)) and ARRAY(expr, ...) patterns where ARRAY is used as a function to construct arrays from expressions. Fixes 127 corpus test failures (119 redshift, 7 ansi, 1 hive).
…arsing Some corpus files (notably unparsed_redshift) come from query logs where actual newlines were escaped as literal \n / \r sequences. Unescape them before parsing so the parser sees proper whitespace tokens. Fixes 563 corpus test failures (365 + 170 + 28 in unparsed_redshift).
Parse each part of a GRANT/REVOKE object name allowing `*` as a wildcard in any position: `*.*`, `*`, `db.*`, `db.schema.*`. This is used in ClickHouse (GRANT SELECT ON *.* TO user) and other dialects that support database/table wildcard grants. Fixes 4 clickhouse corpus test failures.
TABLESAMPLE SYSTEM (10 PERCENT) is valid syntax in BigQuery and other
dialects. The PERCENT keyword after the fraction value was causing a
parse error ("Expected ), found: PERCENT").
Fixes 24 corpus test failures (23 bigquery, 1 ansi).
Adds SelectItem::ExprWildcard to handle CAST(...).*` syntax used in BigQuery to expand all fields of a struct expression as separate columns. The parse_cast_expr field-access loop was updated to stop before `.*` so parse_select_item can detect and handle the wildcard expansion. Fixes 25 corpus test failures (24 tier-1 customer/unparsed BigQuery).
…dx]) In Redshift, bracket-delimited identifiers like [col_name] were being mistakenly tokenized from expressions like col[tbl.idx] — treating [tbl.idx] as a bracket identifier rather than an array subscript. Two fixes: 1. redshift.rs: is_proper_identifier_inside_quotes now scans only up to the closing ] and rejects content containing dots, so [tbl.col] is treated as an array subscript rather than a bracket identifier. 2. parse_map_key: if a no-keyword word is followed by '.', fall back to parse_expr() to handle compound identifiers as subscript indices. Fixes 13 corpus test failures (all tier-1 unparsed_redshift).
…l results Fixes parse_map_key to use parse_expr() for function-call keys, enabling expressions like ARRAY_SIZE(SPLIT(col, '/')) - 1 as subscript indices. Fixes 21 corpus test failures for "Expected ), found: ]" error pattern.
Adds PivotValueSource::Subquery variant to allow subqueries inside PIVOT's IN clause (e.g. PIVOT(SUM(x) FOR col IN (SELECT DISTINCT col FROM t))). Fixes 12 corpus test failures in customer_snowflake and unparsed_snowflake.
Fixes 14 corpus test failures for Redshift EXTRACT expressions:
- Allow abbreviated date/time parts (d, h, m, qtr, etc.) as custom
identifiers in EXTRACT for RedshiftSqlDialect (previously only
SnowflakeDialect and GenericDialect were supported)
- Allow CAST('field' AS type) as the date/time field in EXTRACT for
Redshift/Generic dialects; the inner string literal is mapped to the
canonical DateTimeField variant (e.g. CAST('epoch' AS VARCHAR) → EPOCH)
- Extract date_time_field_from_str() helper to share string-to-field
mapping between the SingleQuotedString arm and the new CAST arm
Corpus Parsing ReportTotal: 135445 passed, 773 failed (99.4% pass rate) Changes✅ 99 improvement(s) (now passing) By Dialect
✅ Improvements (99)➕ New Tests (31132) |
… ON DUPLICATE KEY
Two fixes:
1. Tokenizer: Track prev_char to disambiguate bracket-delimited identifiers
vs array subscripts. When '[' immediately follows a word character, ')'
or ']' (no whitespace), treat as subscript not identifier. Previously
is_proper_identifier_inside_quotes scanned past ']' and relied on
accidentally finding '(' later in the file — a fragile heuristic that
broke when we stopped scanning past ']'.
Fixes: 1e7fddb56246, fef8d2b39914, d22ed4e35ac0
2. Parser: Deferred ON clause loop now checks for unconstrained joins
before consuming ON keyword, preventing it from eating MySQL's
ON DUPLICATE KEY UPDATE clause.
Fixes: bc74e34ece55
- Redshift: ARRAY[col] inside function calls must parse as array literal, not as bracket-quoted identifier; bracket identifiers with spaces still work - MySQL: INSERT...SELECT...UNION SELECT...ON DUPLICATE KEY UPDATE must not have its ON consumed by the deferred join ON clause logic
LiHRaM
approved these changes
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes by dialect