From 9d6c189858ba209a01e2b4a65e8a1701e011d3c8 Mon Sep 17 00:00:00 2001 From: Andy Grove Date: Thu, 30 Apr 2026 12:17:54 -0600 Subject: [PATCH] docs: explain Java vs Rust regexp engine differences in compatibility guide --- .../user-guide/latest/compatibility/regex.md | 77 ++++++++++++++++++- 1 file changed, 74 insertions(+), 3 deletions(-) diff --git a/docs/source/user-guide/latest/compatibility/regex.md b/docs/source/user-guide/latest/compatibility/regex.md index 4d9d5b650c..c93b68a29c 100644 --- a/docs/source/user-guide/latest/compatibility/regex.md +++ b/docs/source/user-guide/latest/compatibility/regex.md @@ -19,6 +19,77 @@ under the License. # Regular Expressions -Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's -regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but -this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`. +Spark evaluates regular expressions with Java's [`java.util.regex`] engine. Comet evaluates them with the Rust +[`regex`] crate. The two engines have different feature sets and different default behavior, so the same pattern +may produce different results, fail to compile, or match different input on each side. + +Because the differences are pattern-dependent and not always obvious, Comet falls back to Spark for any +expression that uses a regular expression by default. To opt in to native evaluation, set +`spark.comet.expression.regexp.allowIncompatible=true`. This is appropriate when you have verified that your +patterns are simple enough that the two engines agree, or when you are willing to accept the difference. + +The expressions affected by this fallback include `RLike`, `RegExpReplace`, `StringSplit`, and `Like` (which +Comet translates to a regular expression internally). + +## Why the engines differ + +Java's `java.util.regex` is a backtracking engine in the Perl/PCRE family. It supports the full range of +features that style of engine provides, including some whose worst-case running time grows exponentially with +the input. + +Rust's [`regex`] crate is a finite-automaton engine in the [RE2] family. It deliberately omits features that +cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs +in time linear in the size of the input. This is the same trade-off RE2, Go's `regexp`, and several other +engines make. + +The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and +several constructs that look the same in source have different semantics on the two sides. + +## Features supported by Java but not by the Rust engine + +Patterns that use any of the following will not compile in Comet's native engine and must run on Spark: + +- **Backreferences** such as `\1`, `\2`, or `\k`. The Rust engine has no backtracking and cannot match + a previously captured group. +- **Lookaround**, including lookahead (`(?=...)`, `(?!...)`) and lookbehind (`(?<=...)`, `(?...)`). +- **Possessive quantifiers** (`*+`, `++`, `?+`, `{n,m}+`). Rust supports greedy and lazy quantifiers but not + possessive. +- **Embedded code, conditionals, and recursion** such as `(?(cond)yes|no)` or `(?R)`. Rust accepts none of + these. + +## Features that exist on both sides but behave differently + +Even where both engines accept a construct, the matching behavior is not always the same. + +- **Unicode-aware character classes.** In the Rust engine, `\d`, `\w`, `\s`, and `.` are Unicode-aware by + default, so `\d` matches every digit codepoint defined by Unicode rather than only `0`–`9`. Java's defaults + match ASCII only and require the `UNICODE_CHARACTER_CLASS` flag (or `(?U)` inline) to switch to Unicode + semantics. The same pattern can therefore match a different set of characters on each side. +- **Line terminators.** In multiline mode, Java treats `\r`, `\n`, `\r\n`, and a few additional Unicode line + separators as line boundaries by default. The Rust engine treats only `\n` as a line boundary unless CRLF + mode is enabled. `^`, `$`, and `.` (with `(?s)` off) all depend on this definition. +- **Case-insensitive matching.** Both engines support `(?i)`, but Java's default is ASCII case folding while + the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters + outside ASCII can produce different results. +- **POSIX character classes.** The Rust engine supports `[[:alpha:]]` style POSIX classes inside bracket + expressions but not Java's `\p{Alpha}` shorthand. Java accepts both. Unicode property escapes (`\p{L}`, + `\p{Greek}`, etc.) are supported by both engines but cover slightly different sets of properties. +- **Octal and Unicode escapes.** Java accepts `\0nnn` for octal and `\uXXXX` for a BMP codepoint. Rust uses + `\x{...}` for arbitrary codepoints and does not accept Java's bare `\uXXXX` form. +- **Empty matches in `split`.** Spark's `StringSplit`, which is built on Java's regex, includes leading empty + strings produced by zero-width matches at the start of the input. The Rust engine's `split` follows different + rules, so split results can differ in edge cases involving empty matches even when the pattern itself is + identical on both sides. + +## When opting in is safe + +For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and +ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this +shape, enabling `spark.comet.expression.regexp.allowIncompatible=true` is generally safe. For anything that +uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling defaults, leave the +fallback in place. + +[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html +[`regex`]: https://docs.rs/regex/latest/regex/ +[RE2]: https://github.com/google/re2/wiki/Syntax