Replace regex-based LIKE matching with a linear-time direct matcher by arnaud-lacurie · Pull Request #4235 · FoundationDB/fdb-record-layer

arnaud-lacurie · 2026-05-27T23:52:46Z

Summary

PatternForLikeValue was translating SQL LIKE patterns into Java regex strings (% → .*, _ → .), and LikeOperatorValue was compiling them via Pattern.compile() on every evaluated row. Java's NFA engine backtracks super-polynomially on patterns like %a%a%a%, which is a CPU-exhaustion risk for any caller that can submit a LIKE query.
Replace the entire regex pipeline with a two-pointer iterative LIKE matcher that runs in O(n·m) with no backtracking. PatternForLikeValue.eval() now emits a normalized LIKE pattern (%/_ as wildcards, \ as internal escape) rather than a regex string. Pattern.compile is removed entirely.
All 75 existing LikeOperatorValueTest cases pass unchanged.

PatternForLikeValue previously translated SQL LIKE patterns into Java regex strings (% → .*, _ → .) and LikeOperatorValue compiled them via Pattern.compile() per evaluated row. Java's NFA engine backtracks super-polynomially on patterns like %a%a%a%, making this a CPU-exhaustion vector for any caller that can submit a LIKE query. Replace the regex pipeline with a two-pointer iterative LIKE matcher that runs in O(n·m) time with no backtracking. PatternForLikeValue now emits a normalized LIKE pattern (% and _ as wildcards, \ as escape) instead of a regex string. Pattern.compile is gone entirely.

When text contained '%' at a position where the pattern also had '%', the literal equality branch fired before the wildcard branch, causing '%' in the pattern to be consumed as a literal match. Move the '%' wildcard check before the literal equality check and add regression test cases.

RobertBrunel · 2026-06-17T15:00:56Z

+    }
+
+    /**
+     * Linear-time SQL LIKE matcher. The {@code pattern} is a normalized LIKE pattern produced by


“Linear-time” is imprecise, as there are two inputs. Rather, it’s polynomial-time, O(n × m), where n is the length of text and m is the length of pattern.

RobertBrunel · 2026-06-17T15:04:15Z

+                final char pc = pattern.charAt(p);
+                if (pc == '\\') {
+                    final char literal = (p + 1 < pLen) ? pattern.charAt(p + 1) : 0;
+                    if (literal != 0 && literal == text.charAt(t)) {


Suggested change

if (literal != 0 && literal == text.charAt(t)) {

if (p + 1 < pLen && pattern.charAt(p + 1) == text.charAt(t)) {

This is more precise and more closely mirrors the while condition below. It also avoids potential issues when the 0 character validly appears in pattern (although I suppose that can’t practically happen in SQL patterns).

RobertBrunel · 2026-06-17T15:09:33Z

+     * exactly one character, and {@code \} escapes the next character as a literal (so {@code \%},
+     * {@code \_}, and {@code \\} are literals). All other characters match themselves.
+     */
+    private static boolean matchLike(final String text, final String pattern) {


Add @Nonnull.

Maybe validate that there’s no trailing single \, which would be an invalid escape sequence. Plus also unit-test that this case is handled gracefully. (Or arguably, removing trailing \ belongs into the pattern normalization code.)

RobertBrunel · 2026-06-17T15:18:44Z

-                    .put(escapeChar + "%", "%")
+                    .put(escapeChar + "_", "\\_")
+                    .put(escapeChar + "%", "\\%")
                    .putAll(REPLACE_MAP)


Also put(escapeChar + escapeChar, "\\\\").
(Plus add a unit case for this case if there isn’t.)

Note that the order in which these replacements get applied also matters, e.g. when you have a pattern like \\% that can be interpreted in two ways. There could be (pre-existing) bugs lurking here!

I think it might be a little more complicated than that. I was trying to figure this out, but then I gave up, but it looks like when escapeChild is null, then it doesn't have any escape character. So that would imply that:

SELECT * FROM T WHERE T.str LIKE '\%'

Should match all strings prefixed by \ (unless the plan generator is inserting a default escape child value with \ somewhere). And the REPLACE_MAP in this case will turn the pattern into \\%, so that's fine. Whereas:

SELECT * FROM T WHERE T.str LIKE '\%' ESCAPE '\'

Should match just the percent sign. I think this is the behavior followed by Oracle (https://docs.oracle.com/cd/B13789_01/server.101/b10759/conditions016.htm) but not Postgres (https://www.postgresql.org/docs/7.3/functions-matching.html -- default is \ but the empty string removes escaping).

It does seem like double escape needs to match single escape. But that would mean adding put(escapeChar + escapeChar, escapeChar) unless escapeChar is \. But if the escape char is backslash, then we need to not include the default backslash to double backslash escape in REPLACE_MAP. I think?

It would also be nice if this whole find-and-replace thing could be removed. If this instead returned a record with both the pattern and the escape character, then we could pass that to the match function directly rather than needing to rewrite it here

RobertBrunel · 2026-06-17T15:22:45Z

            return null;
        }
        Map<String, String> replaceMap;
        if (escapeChar == null) {


This whole block that builds the replaceMap seems worthwhile to extract into a static method, while we’re already touching it. Then it could be independently understood and unit-tested (though we don’t necessarily have to add such unit tests in this PR).

RobertBrunel · 2026-06-17T15:31:31Z

+     * exactly one character, and {@code \} escapes the next character as a literal (so {@code \%},
+     * {@code \_}, and {@code \\} are literals). All other characters match themselves.
+     */
+    private static boolean matchLike(final String text, final String pattern) {


Consider changing from String to char[] arrays (passing text.toCharArray() at the call site). That is more efficient since we can then access characters directly instead of String.charAt(). It’s a micro-optimization though.

RobertBrunel · 2026-06-17T15:33:25Z

+                        matched = true;
+                    }
+                } else if (pc == '%') {
+                    starP = p++;


Maybe skip consecutive % here:

while (p + 1 < pLen && pattern.charAt(p + 1) == '%') p++;

Though, rather, this should be simplified during pattern normalization.

I'm not sure it's worth it to add a special case for that, as we'll just come around the next time round the loop and do the same thing. I'm also not sure it's worth doing more pattern normalization just for that, though I agree that in principle, consecutive runs of %s could be collapsed

RobertBrunel · 2026-06-17T15:34:34Z

+                    matched = true;
+                }
+            }
+            if (!matched) {


It looks like the matched boolean could be avoided with the help of continue. That’s simpler (less state), though not sure if it’ll be more readable.

RobertBrunel · 2026-06-17T15:36:07Z

    }

    @Nullable
    public static Boolean likeOperation(final String lhs, final String rhs) {


While here, mark the arguments as @Nullable.

RobertBrunel · 2026-06-17T15:37:48Z

+
+    /**
+     * Linear-time SQL LIKE matcher. The {@code pattern} is a normalized LIKE pattern produced by
+     * {@link PatternForLikeValue}: {@code %} matches any sequence of characters, {@code _} matches


It’s not so nice that we have this “upwards” reference and logical dependency to PatternForLikeValue here. Can we move the normalization code into this file to make it self-contained? Not necessarily in scope of this PR though.

alecgrieser · 2026-06-17T15:24:48Z

+     * exactly one character, and {@code \} escapes the next character as a literal (so {@code \%},
+     * {@code \_}, and {@code \\} are literals). All other characters match themselves.
+     */
+    private static boolean matchLike(final String text, final String pattern) {


minor, but I think we need to mark these as @Nonnull.

alecgrieser · 2026-06-17T15:28:48Z

+                final char pc = pattern.charAt(p);
+                if (pc == '\\') {
+                    final char literal = (p + 1 < pLen) ? pattern.charAt(p + 1) : 0;
+                    if (literal != 0 && literal == text.charAt(t)) {


I believe this means that \<c> matches character <c>. That's true for \%, \\, and \_, but it would also mean that the pattern \n would match n rather than, say, the newline. If we care about that

alecgrieser · 2026-06-17T18:32:04Z

+            if (!matched) {
+                if (starP >= 0) {
+                    p = starP + 1;
+                    t = ++starT;


Maybe this could use a little more documentation. I believe the idea is that:

We match from left to right until we hit our first %

After that, we keep consuming character by character. We greedily try to match each character to new characters in the pattern, but if we can't, then we revert back to our last % (effectively, matching that character with the wildcard)

This keeps going until we consume the entire text

If we've consumed the entire pattern at the end, then we're done (with one extra check for trailing %s)

I believe this is equivalent to breaking up your pattern into alternating chunks of non-wildcard patterns and wildcards. From left to right, you try to match every wildcard to a minimum number of characters that still lets you match the next non-wildcard pattern.

I was trying to make sure there wasn't a weird edge case where backtracking beyond the last % actually is necessary. But I think it's fine. In particular, the key insight is that it only needs to find one possible match because you can always defer consuming such a match to the next pattern. It's a bit hand-wavy, but I think the insight is something like:

Suppose we have a pattern string p = p₁ + '%' + p₂ + '%' + p₃ where each of p_i has no wild cards. Suppose there's a string t that matches the pattern p. That means that t = t₁ + y_a + t₂ + y_b + t₃ where t_i matches pattern p_i. For the algorithm to hold, then we should be able to take the minimal length y_a such that we can construct a t₂ that matches p₂. Suppose we can't do that and still match the whole pattern, and we instead need to actually consume two such pattern-matching strings t_{2 a} and t_{2 b}. So, t = t₁ + y_a + (t_{2 a} + y_c + t_{2 b}) + y_b + t₃. But you can re-arrange that as t = t₁ + y_a + t_{2 a} + (y_c + t_{2 b} + y_b) + t₃. And in this case, y_c + t_{2 b} + y_b will be a minimal length string that still allows us to match t₃ to p₃

Then "just" use the principle of induction to extrapolate from a 2 wild card pattern to an n wildcard pattern.

alecgrieser · 2026-06-18T10:43:24Z

+                        matched = true;
+                    }
+                } else if (pc == '%') {
+                    starP = p++;


I'm not sure it's worth it to add a special case for that, as we'll just come around the next time round the loop and do the same thing. I'm also not sure it's worth doing more pattern normalization just for that, though I agree that in principle, consecutive runs of %s could be collapsed

alecgrieser · 2026-06-18T12:06:08Z

-                    .put(escapeChar + "%", "%")
+                    .put(escapeChar + "_", "\\_")
+                    .put(escapeChar + "%", "\\%")
                    .putAll(REPLACE_MAP)


I think it might be a little more complicated than that. I was trying to figure this out, but then I gave up, but it looks like when escapeChild is null, then it doesn't have any escape character. So that would imply that:

SELECT * FROM T WHERE T.str LIKE '\%'

Should match all strings prefixed by \ (unless the plan generator is inserting a default escape child value with \ somewhere). And the REPLACE_MAP in this case will turn the pattern into \\%, so that's fine. Whereas:

SELECT * FROM T WHERE T.str LIKE '\%' ESCAPE '\'

Should match just the percent sign. I think this is the behavior followed by Oracle (https://docs.oracle.com/cd/B13789_01/server.101/b10759/conditions016.htm) but not Postgres (https://www.postgresql.org/docs/7.3/functions-matching.html -- default is \ but the empty string removes escaping).

It does seem like double escape needs to match single escape. But that would mean adding put(escapeChar + escapeChar, escapeChar) unless escapeChar is \. But if the escape char is backslash, then we need to not include the default backslash to double backslash escape in REPLACE_MAP. I think?

arnaud-lacurie added the bug fix Change that fixes a bug label May 27, 2026

arnaud-lacurie marked this pull request as draft May 27, 2026 23:54

arnaud-lacurie added 3 commits June 1, 2026 14:32

Fix checkstyle: split multiple variable declarations onto separate lines

46f66fd

Simplify

5bf3a54

arnaud-lacurie marked this pull request as ready for review June 6, 2026 22:45

arnaud-lacurie requested review from RobertBrunel and alecgrieser June 6, 2026 22:45

RobertBrunel requested changes Jun 17, 2026

View reviewed changes

alecgrieser requested changes Jun 18, 2026

View reviewed changes

	if (literal != 0 && literal == text.charAt(t)) {
	if (p + 1 < pLen && pattern.charAt(p + 1) == text.charAt(t)) {

Uh oh!

Conversation

arnaud-lacurie commented May 27, 2026

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants