Skip to content

Replace regex-based LIKE matching with a linear-time direct matcher#4235

Open
arnaud-lacurie wants to merge 4 commits into
FoundationDB:mainfrom
arnaud-lacurie:fix-like-redos
Open

Replace regex-based LIKE matching with a linear-time direct matcher#4235
arnaud-lacurie wants to merge 4 commits into
FoundationDB:mainfrom
arnaud-lacurie:fix-like-redos

Conversation

@arnaud-lacurie

Copy link
Copy Markdown
Collaborator

Summary

  • PatternForLikeValue was translating SQL LIKE patterns into Java regex strings (%.*, _.), and LikeOperatorValue was compiling them via Pattern.compile() on every evaluated row. Java's NFA engine backtracks super-polynomially on patterns like %a%a%a%, which is a CPU-exhaustion risk for any caller that can submit a LIKE query.
  • Replace the entire regex pipeline with a two-pointer iterative LIKE matcher that runs in O(n·m) with no backtracking. PatternForLikeValue.eval() now emits a normalized LIKE pattern (%/_ as wildcards, \ as internal escape) rather than a regex string. Pattern.compile is removed entirely.
  • All 75 existing LikeOperatorValueTest cases pass unchanged.

PatternForLikeValue previously translated SQL LIKE patterns into Java
regex strings (% → .*, _ → .) and LikeOperatorValue compiled them via
Pattern.compile() per evaluated row. Java's NFA engine backtracks
super-polynomially on patterns like %a%a%a%, making this a CPU-exhaustion
vector for any caller that can submit a LIKE query.

Replace the regex pipeline with a two-pointer iterative LIKE matcher that
runs in O(n·m) time with no backtracking. PatternForLikeValue now emits a
normalized LIKE pattern (% and _ as wildcards, \ as escape) instead of a
regex string. Pattern.compile is gone entirely.
@arnaud-lacurie arnaud-lacurie added the bug fix Change that fixes a bug label May 27, 2026
@arnaud-lacurie arnaud-lacurie marked this pull request as draft May 27, 2026 23:54
When text contained '%' at a position where the pattern also had '%', the
literal equality branch fired before the wildcard branch, causing '%' in
the pattern to be consumed as a literal match. Move the '%' wildcard check
before the literal equality check and add regression test cases.
@arnaud-lacurie arnaud-lacurie marked this pull request as ready for review June 6, 2026 22:45
}

/**
* Linear-time SQL LIKE matcher. The {@code pattern} is a normalized LIKE pattern produced by

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Linear-time” is imprecise, as there are two inputs. Rather, it’s polynomial-time, O(n × m), where n is the length of text and m is the length of pattern.

final char pc = pattern.charAt(p);
if (pc == '\\') {
final char literal = (p + 1 < pLen) ? pattern.charAt(p + 1) : 0;
if (literal != 0 && literal == text.charAt(t)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (literal != 0 && literal == text.charAt(t)) {
if (p + 1 < pLen && pattern.charAt(p + 1) == text.charAt(t)) {

This is more precise and more closely mirrors the while condition below. It also avoids potential issues when the 0 character validly appears in pattern (although I suppose that can’t practically happen in SQL patterns).

* exactly one character, and {@code \} escapes the next character as a literal (so {@code \%},
* {@code \_}, and {@code \\} are literals). All other characters match themselves.
*/
private static boolean matchLike(final String text, final String pattern) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add @Nonnull.
  • Maybe validate that there’s no trailing single \, which would be an invalid escape sequence. Plus also unit-test that this case is handled gracefully. (Or arguably, removing trailing \ belongs into the pattern normalization code.)

.put(escapeChar + "%", "%")
.put(escapeChar + "_", "\\_")
.put(escapeChar + "%", "\\%")
.putAll(REPLACE_MAP)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also put(escapeChar + escapeChar, "\\\\").
(Plus add a unit case for this case if there isn’t.)

Note that the order in which these replacements get applied also matters, e.g. when you have a pattern like \\% that can be interpreted in two ways. There could be (pre-existing) bugs lurking here!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a little more complicated than that. I was trying to figure this out, but then I gave up, but it looks like when escapeChild is null, then it doesn't have any escape character. So that would imply that:

SELECT * FROM T WHERE T.str LIKE '\%'

Should match all strings prefixed by \ (unless the plan generator is inserting a default escape child value with \ somewhere). And the REPLACE_MAP in this case will turn the pattern into \\%, so that's fine. Whereas:

SELECT * FROM T WHERE T.str LIKE '\%' ESCAPE '\'

Should match just the percent sign. I think this is the behavior followed by Oracle (https://docs.oracle.com/cd/B13789_01/server.101/b10759/conditions016.htm) but not Postgres (https://www.postgresql.org/docs/7.3/functions-matching.html -- default is \ but the empty string removes escaping).

It does seem like double escape needs to match single escape. But that would mean adding put(escapeChar + escapeChar, escapeChar) unless escapeChar is \. But if the escape char is backslash, then we need to not include the default backslash to double backslash escape in REPLACE_MAP. I think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be nice if this whole find-and-replace thing could be removed. If this instead returned a record with both the pattern and the escape character, then we could pass that to the match function directly rather than needing to rewrite it here

return null;
}
Map<String, String> replaceMap;
if (escapeChar == null) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block that builds the replaceMap seems worthwhile to extract into a static method, while we’re already touching it. Then it could be independently understood and unit-tested (though we don’t necessarily have to add such unit tests in this PR).

* exactly one character, and {@code \} escapes the next character as a literal (so {@code \%},
* {@code \_}, and {@code \\} are literals). All other characters match themselves.
*/
private static boolean matchLike(final String text, final String pattern) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing from String to char[] arrays (passing text.toCharArray() at the call site). That is more efficient since we can then access characters directly instead of String.charAt(). It’s a micro-optimization though.

matched = true;
}
} else if (pc == '%') {
starP = p++;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe skip consecutive % here:

while (p + 1 < pLen && pattern.charAt(p + 1) == '%') p++;

Though, rather, this should be simplified during pattern normalization.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth it to add a special case for that, as we'll just come around the next time round the loop and do the same thing. I'm also not sure it's worth doing more pattern normalization just for that, though I agree that in principle, consecutive runs of %s could be collapsed

matched = true;
}
}
if (!matched) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the matched boolean could be avoided with the help of continue. That’s simpler (less state), though not sure if it’ll be more readable.

}

@Nullable
public static Boolean likeOperation(final String lhs, final String rhs) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While here, mark the arguments as @Nullable.


/**
* Linear-time SQL LIKE matcher. The {@code pattern} is a normalized LIKE pattern produced by
* {@link PatternForLikeValue}: {@code %} matches any sequence of characters, {@code _} matches

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not so nice that we have this “upwards” reference and logical dependency to PatternForLikeValue here. Can we move the normalization code into this file to make it self-contained? Not necessarily in scope of this PR though.

* exactly one character, and {@code \} escapes the next character as a literal (so {@code \%},
* {@code \_}, and {@code \\} are literals). All other characters match themselves.
*/
private static boolean matchLike(final String text, final String pattern) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but I think we need to mark these as @Nonnull.

final char pc = pattern.charAt(p);
if (pc == '\\') {
final char literal = (p + 1 < pLen) ? pattern.charAt(p + 1) : 0;
if (literal != 0 && literal == text.charAt(t)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this means that \<c> matches character <c>. That's true for \%, \\, and \_, but it would also mean that the pattern \n would match n rather than, say, the newline. If we care about that

if (!matched) {
if (starP >= 0) {
p = starP + 1;
t = ++starT;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could use a little more documentation. I believe the idea is that:

  1. We match from left to right until we hit our first %
  2. After that, we keep consuming character by character. We greedily try to match each character to new characters in the pattern, but if we can't, then we revert back to our last % (effectively, matching that character with the wildcard)
  3. This keeps going until we consume the entire text
  4. If we've consumed the entire pattern at the end, then we're done (with one extra check for trailing %s)

I believe this is equivalent to breaking up your pattern into alternating chunks of non-wildcard patterns and wildcards. From left to right, you try to match every wildcard to a minimum number of characters that still lets you match the next non-wildcard pattern.

I was trying to make sure there wasn't a weird edge case where backtracking beyond the last % actually is necessary. But I think it's fine. In particular, the key insight is that it only needs to find one possible match because you can always defer consuming such a match to the next pattern. It's a bit hand-wavy, but I think the insight is something like:

Suppose we have a pattern string p = p1 + '%' + p2 + '%' + p3 where each of pi has no wild cards. Suppose there's a string t that matches the pattern p. That means that t = t1 + ya + t2 + yb + t3 where ti matches pattern pi. For the algorithm to hold, then we should be able to take the minimal length ya such that we can construct a t2 that matches p2. Suppose we can't do that and still match the whole pattern, and we instead need to actually consume two such pattern-matching strings t2 a and t2 b. So, t = t1 + ya + (t2 a + yc + t2 b) + yb + t3. But you can re-arrange that as t = t1 + ya + t2 a + (yc + t2 b + yb) + t3. And in this case, yc + t2 b + yb will be a minimal length string that still allows us to match t3 to p3

Then "just" use the principle of induction to extrapolate from a 2 wild card pattern to an n wildcard pattern.

matched = true;
}
} else if (pc == '%') {
starP = p++;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth it to add a special case for that, as we'll just come around the next time round the loop and do the same thing. I'm also not sure it's worth doing more pattern normalization just for that, though I agree that in principle, consecutive runs of %s could be collapsed

.put(escapeChar + "%", "%")
.put(escapeChar + "_", "\\_")
.put(escapeChar + "%", "\\%")
.putAll(REPLACE_MAP)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a little more complicated than that. I was trying to figure this out, but then I gave up, but it looks like when escapeChild is null, then it doesn't have any escape character. So that would imply that:

SELECT * FROM T WHERE T.str LIKE '\%'

Should match all strings prefixed by \ (unless the plan generator is inserting a default escape child value with \ somewhere). And the REPLACE_MAP in this case will turn the pattern into \\%, so that's fine. Whereas:

SELECT * FROM T WHERE T.str LIKE '\%' ESCAPE '\'

Should match just the percent sign. I think this is the behavior followed by Oracle (https://docs.oracle.com/cd/B13789_01/server.101/b10759/conditions016.htm) but not Postgres (https://www.postgresql.org/docs/7.3/functions-matching.html -- default is \ but the empty string removes escaping).

It does seem like double escape needs to match single escape. But that would mean adding put(escapeChar + escapeChar, escapeChar) unless escapeChar is \. But if the escape char is backslash, then we need to not include the default backslash to double backslash escape in REPLACE_MAP. I think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix Change that fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants