Replace regex-based LIKE matching with a linear-time direct matcher#4235
Replace regex-based LIKE matching with a linear-time direct matcher#4235arnaud-lacurie wants to merge 4 commits into
Conversation
PatternForLikeValue previously translated SQL LIKE patterns into Java regex strings (% → .*, _ → .) and LikeOperatorValue compiled them via Pattern.compile() per evaluated row. Java's NFA engine backtracks super-polynomially on patterns like %a%a%a%, making this a CPU-exhaustion vector for any caller that can submit a LIKE query. Replace the regex pipeline with a two-pointer iterative LIKE matcher that runs in O(n·m) time with no backtracking. PatternForLikeValue now emits a normalized LIKE pattern (% and _ as wildcards, \ as escape) instead of a regex string. Pattern.compile is gone entirely.
When text contained '%' at a position where the pattern also had '%', the literal equality branch fired before the wildcard branch, causing '%' in the pattern to be consumed as a literal match. Move the '%' wildcard check before the literal equality check and add regression test cases.
| } | ||
|
|
||
| /** | ||
| * Linear-time SQL LIKE matcher. The {@code pattern} is a normalized LIKE pattern produced by |
There was a problem hiding this comment.
“Linear-time” is imprecise, as there are two inputs. Rather, it’s polynomial-time, O(n × m), where n is the length of text and m is the length of pattern.
| final char pc = pattern.charAt(p); | ||
| if (pc == '\\') { | ||
| final char literal = (p + 1 < pLen) ? pattern.charAt(p + 1) : 0; | ||
| if (literal != 0 && literal == text.charAt(t)) { |
There was a problem hiding this comment.
| if (literal != 0 && literal == text.charAt(t)) { | |
| if (p + 1 < pLen && pattern.charAt(p + 1) == text.charAt(t)) { |
This is more precise and more closely mirrors the while condition below. It also avoids potential issues when the 0 character validly appears in pattern (although I suppose that can’t practically happen in SQL patterns).
| * exactly one character, and {@code \} escapes the next character as a literal (so {@code \%}, | ||
| * {@code \_}, and {@code \\} are literals). All other characters match themselves. | ||
| */ | ||
| private static boolean matchLike(final String text, final String pattern) { |
There was a problem hiding this comment.
- Add
@Nonnull. - Maybe validate that there’s no trailing single
\, which would be an invalid escape sequence. Plus also unit-test that this case is handled gracefully. (Or arguably, removing trailing\belongs into the pattern normalization code.)
| .put(escapeChar + "%", "%") | ||
| .put(escapeChar + "_", "\\_") | ||
| .put(escapeChar + "%", "\\%") | ||
| .putAll(REPLACE_MAP) |
There was a problem hiding this comment.
Also put(escapeChar + escapeChar, "\\\\").
(Plus add a unit case for this case if there isn’t.)
Note that the order in which these replacements get applied also matters, e.g. when you have a pattern like \\% that can be interpreted in two ways. There could be (pre-existing) bugs lurking here!
There was a problem hiding this comment.
I think it might be a little more complicated than that. I was trying to figure this out, but then I gave up, but it looks like when escapeChild is null, then it doesn't have any escape character. So that would imply that:
SELECT * FROM T WHERE T.str LIKE '\%'Should match all strings prefixed by \ (unless the plan generator is inserting a default escape child value with \ somewhere). And the REPLACE_MAP in this case will turn the pattern into \\%, so that's fine. Whereas:
SELECT * FROM T WHERE T.str LIKE '\%' ESCAPE '\'Should match just the percent sign. I think this is the behavior followed by Oracle (https://docs.oracle.com/cd/B13789_01/server.101/b10759/conditions016.htm) but not Postgres (https://www.postgresql.org/docs/7.3/functions-matching.html -- default is \ but the empty string removes escaping).
It does seem like double escape needs to match single escape. But that would mean adding put(escapeChar + escapeChar, escapeChar) unless escapeChar is \. But if the escape char is backslash, then we need to not include the default backslash to double backslash escape in REPLACE_MAP. I think?
There was a problem hiding this comment.
It would also be nice if this whole find-and-replace thing could be removed. If this instead returned a record with both the pattern and the escape character, then we could pass that to the match function directly rather than needing to rewrite it here
| return null; | ||
| } | ||
| Map<String, String> replaceMap; | ||
| if (escapeChar == null) { |
There was a problem hiding this comment.
This whole block that builds the replaceMap seems worthwhile to extract into a static method, while we’re already touching it. Then it could be independently understood and unit-tested (though we don’t necessarily have to add such unit tests in this PR).
| * exactly one character, and {@code \} escapes the next character as a literal (so {@code \%}, | ||
| * {@code \_}, and {@code \\} are literals). All other characters match themselves. | ||
| */ | ||
| private static boolean matchLike(final String text, final String pattern) { |
There was a problem hiding this comment.
Consider changing from String to char[] arrays (passing text.toCharArray() at the call site). That is more efficient since we can then access characters directly instead of String.charAt(). It’s a micro-optimization though.
| matched = true; | ||
| } | ||
| } else if (pc == '%') { | ||
| starP = p++; |
There was a problem hiding this comment.
Maybe skip consecutive % here:
while (p + 1 < pLen && pattern.charAt(p + 1) == '%') p++;
Though, rather, this should be simplified during pattern normalization.
There was a problem hiding this comment.
I'm not sure it's worth it to add a special case for that, as we'll just come around the next time round the loop and do the same thing. I'm also not sure it's worth doing more pattern normalization just for that, though I agree that in principle, consecutive runs of %s could be collapsed
| matched = true; | ||
| } | ||
| } | ||
| if (!matched) { |
There was a problem hiding this comment.
It looks like the matched boolean could be avoided with the help of continue. That’s simpler (less state), though not sure if it’ll be more readable.
| } | ||
|
|
||
| @Nullable | ||
| public static Boolean likeOperation(final String lhs, final String rhs) { |
There was a problem hiding this comment.
While here, mark the arguments as @Nullable.
|
|
||
| /** | ||
| * Linear-time SQL LIKE matcher. The {@code pattern} is a normalized LIKE pattern produced by | ||
| * {@link PatternForLikeValue}: {@code %} matches any sequence of characters, {@code _} matches |
There was a problem hiding this comment.
It’s not so nice that we have this “upwards” reference and logical dependency to PatternForLikeValue here. Can we move the normalization code into this file to make it self-contained? Not necessarily in scope of this PR though.
| * exactly one character, and {@code \} escapes the next character as a literal (so {@code \%}, | ||
| * {@code \_}, and {@code \\} are literals). All other characters match themselves. | ||
| */ | ||
| private static boolean matchLike(final String text, final String pattern) { |
There was a problem hiding this comment.
minor, but I think we need to mark these as @Nonnull.
| final char pc = pattern.charAt(p); | ||
| if (pc == '\\') { | ||
| final char literal = (p + 1 < pLen) ? pattern.charAt(p + 1) : 0; | ||
| if (literal != 0 && literal == text.charAt(t)) { |
There was a problem hiding this comment.
I believe this means that \<c> matches character <c>. That's true for \%, \\, and \_, but it would also mean that the pattern \n would match n rather than, say, the newline. If we care about that
| if (!matched) { | ||
| if (starP >= 0) { | ||
| p = starP + 1; | ||
| t = ++starT; |
There was a problem hiding this comment.
Maybe this could use a little more documentation. I believe the idea is that:
- We match from left to right until we hit our first
% - After that, we keep consuming character by character. We greedily try to match each character to new characters in the pattern, but if we can't, then we revert back to our last
%(effectively, matching that character with the wildcard) - This keeps going until we consume the entire text
- If we've consumed the entire pattern at the end, then we're done (with one extra check for trailing
%s)
I believe this is equivalent to breaking up your pattern into alternating chunks of non-wildcard patterns and wildcards. From left to right, you try to match every wildcard to a minimum number of characters that still lets you match the next non-wildcard pattern.
I was trying to make sure there wasn't a weird edge case where backtracking beyond the last % actually is necessary. But I think it's fine. In particular, the key insight is that it only needs to find one possible match because you can always defer consuming such a match to the next pattern. It's a bit hand-wavy, but I think the insight is something like:
Suppose we have a pattern string p = p1 + '%' + p2 + '%' + p3 where each of pi has no wild cards. Suppose there's a string t that matches the pattern p. That means that t = t1 + ya + t2 + yb + t3 where ti matches pattern pi. For the algorithm to hold, then we should be able to take the minimal length ya such that we can construct a t2 that matches p2. Suppose we can't do that and still match the whole pattern, and we instead need to actually consume two such pattern-matching strings t2 a and t2 b. So, t = t1 + ya + (t2 a + yc + t2 b) + yb + t3. But you can re-arrange that as t = t1 + ya + t2 a + (yc + t2 b + yb) + t3. And in this case, yc + t2 b + yb will be a minimal length string that still allows us to match t3 to p3
Then "just" use the principle of induction to extrapolate from a 2 wild card pattern to an n wildcard pattern.
| matched = true; | ||
| } | ||
| } else if (pc == '%') { | ||
| starP = p++; |
There was a problem hiding this comment.
I'm not sure it's worth it to add a special case for that, as we'll just come around the next time round the loop and do the same thing. I'm also not sure it's worth doing more pattern normalization just for that, though I agree that in principle, consecutive runs of %s could be collapsed
| .put(escapeChar + "%", "%") | ||
| .put(escapeChar + "_", "\\_") | ||
| .put(escapeChar + "%", "\\%") | ||
| .putAll(REPLACE_MAP) |
There was a problem hiding this comment.
I think it might be a little more complicated than that. I was trying to figure this out, but then I gave up, but it looks like when escapeChild is null, then it doesn't have any escape character. So that would imply that:
SELECT * FROM T WHERE T.str LIKE '\%'Should match all strings prefixed by \ (unless the plan generator is inserting a default escape child value with \ somewhere). And the REPLACE_MAP in this case will turn the pattern into \\%, so that's fine. Whereas:
SELECT * FROM T WHERE T.str LIKE '\%' ESCAPE '\'Should match just the percent sign. I think this is the behavior followed by Oracle (https://docs.oracle.com/cd/B13789_01/server.101/b10759/conditions016.htm) but not Postgres (https://www.postgresql.org/docs/7.3/functions-matching.html -- default is \ but the empty string removes escaping).
It does seem like double escape needs to match single escape. But that would mean adding put(escapeChar + escapeChar, escapeChar) unless escapeChar is \. But if the escape char is backslash, then we need to not include the default backslash to double backslash escape in REPLACE_MAP. I think?
Summary
PatternForLikeValuewas translating SQL LIKE patterns into Java regex strings (%→.*,_→.), andLikeOperatorValuewas compiling them viaPattern.compile()on every evaluated row. Java's NFA engine backtracks super-polynomially on patterns like%a%a%a%, which is a CPU-exhaustion risk for any caller that can submit a LIKE query.PatternForLikeValue.eval()now emits a normalized LIKE pattern (%/_as wildcards,\as internal escape) rather than a regex string.Pattern.compileis removed entirely.LikeOperatorValueTestcases pass unchanged.