Skip to content

eaftan/safere

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

386 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SafeRE

A linear-time regular expression matching library for Java.

SafeRE is a port of RE2 to Java, with significant performance optimization work to adapt to the JVM's performance characteristics and approach parity with java.util.regex. Unlike RE2, which uses POSIX leftmost-longest semantics, SafeRE matches java.util.regex semantics and provides a drop-in replacement for java.util.regex.Pattern and java.util.regex.Matcher.

An earlier port of RE2 to Java exists as RE2/J. RE2/J is valuable work, but it is substantially slower than java.util.regex on common workloads and does not provide a drop-in replacement API. SafeRE addresses both of these gaps.

SafeRE guarantees linear-time matching regardless of the pattern or input. It achieves this by using finite automata (DFA/NFA) instead of backtracking. Patterns that require exponential time in java.util.regex — such as a?{25}a{25} matched against a repeated 25 times — complete in microseconds with SafeRE.

Installation

SafeRE is available on Maven Central.

Maven:

<dependency>
  <groupId>org.safere</groupId>
  <artifactId>safere</artifactId>
  <version>0.1.0</version>
</dependency>

Gradle (Kotlin DSL):

implementation("org.safere:safere:0.1.0")

Gradle (Groovy DSL):

implementation 'org.safere:safere:0.1.0'

Quick Start

import org.safere.Pattern;
import org.safere.Matcher;

// Compile a pattern (thread-safe, reusable)
Pattern p = Pattern.compile("(\\w+)@(\\w+\\.\\w+)");

// Match against input
Matcher m = p.matcher("contact user@example.com for info");
if (m.find()) {
    System.out.println(m.group());   // "user@example.com"
    System.out.println(m.group(1));  // "user"
    System.out.println(m.group(2));  // "example.com"
}

SafeRE is a drop-in replacement for java.util.regex.Pattern and java.util.regex.Matcher. Just change your imports.

Development

SafeRE uses google-java-format through Spotless. To format Java sources, run:

mvn spotless:apply

CI checks formatting with:

mvn spotless:check

To have commits format Java sources automatically, enable the repo hooks once:

git config core.hooksPath .githooks

Why SafeRE?

java.util.regex uses a backtracking NFA that can exhibit exponential time complexity on certain patterns. This is a well-known class of ReDoS vulnerabilities. SafeRE eliminates this risk entirely.

Pattern SafeRE RE2/J RE2-FFM JDK SafeRE vs JDK
a?{10}a{10} vs aaaaaaaaaa 0.042 µs 1.72 µs 0.068 µs 9.5 µs 226×
a?{15}a{15} vs aaa... (15) 0.055 µs 3.73 µs 0.082 µs 388 µs 6,690×
a?{20}a{20} vs aaa... (20) 0.072 µs 6.64 µs 0.092 µs 15,389 µs 210,808×
a?{25}a{25} vs aaa... (25) 0.090 µs 10.15 µs 0.099 µs (hangs)

SafeRE grows linearly and is 41–113× faster than RE2/J. The JDK grows exponentially and hangs at n=25.

Features

  • Drop-in APIPattern and Matcher are drop-in replacements for java.util.regex
  • Linear-time guarantee — No input can cause catastrophic backtracking
  • Full Unicode — Operates on Unicode code points, supports \p{...} properties, Unicode-aware case folding
  • Named captures — Java-compatible (?<name>...) syntax
  • Multi-pattern matchingPatternSet matches multiple patterns simultaneously in a single pass
  • Five execution engines — OnePass, DFA, BitState, NFA, and reverse DFA, automatically selected per query

Comparison with RE2 Family

SafeRE is part of a family of linear-time regex libraries that share RE2's core algorithms. Here is how they compare:

Feature RE2 (C++) Go regexp RE2/J SafeRE
Language C++ Go Java Java
Linear-time guarantee
Full Unicode support
Submatch extraction
Named captures
DFA engine
NFA (Pike VM) engine
OnePass engine
BitState engine
Reverse DFA
Literal optimization
Multi-pattern matching ✅ (RE2::Set) ✅ (PatternSet)
Drop-in java.util.regex API
Java version 8+ 21+

Supported Syntax

SafeRE supports most of the syntax from java.util.regex:

Category Syntax
Literals a, \n, \t, \x{1F600}, \Q...\E
Character classes [abc], [a-z], [^0-9], .
Perl classes \d, \D, \s, \S, \w, \W
Unicode properties \p{L}, \p{IsHan}, \P{Digit}, \p{Lower}
Quantifiers *, +, ?, {n}, {n,}, {n,m}
Non-greedy *?, +?, ??, {n,m}?
Alternation a|b
Grouping (...), (?:...)
Named captures (?<name>...)
Anchors ^, $, \A, \z, \b, \B
Flags (?i), (?m), (?s), (?U)

Not Supported

These features violate the linear-time guarantee and are rejected at compile time with a clear error:

  • Backreferences (\1, \2, ...) — require exponential time
  • Lookahead / Lookbehind ((?=...), (?<=...), (?!...), (?<!...))
  • Possessive quantifiers (*+, ++, ?+)
  • Atomic groups ((?>...))

Additionally, the CANON_EQ flag is not supported. This flag enables matching based on Unicode canonical equivalence (e.g., treating a precomposed character the same as its decomposed form). It is rarely used and adds significant implementation complexity.

The Matcher.hitEnd() and Matcher.requireEnd() APIs are not supported. These methods expose details of the JDK backtracking engine's search order, including which alternatives and quantified paths the engine tried before stopping. SafeRE's linear-time engines explore possible states in lockstep instead, and exactly reproducing the JDK's observer state would require simulating backtracking-style path priority in cases that are incompatible with SafeRE's performance model. Direct use of these methods appears rare; they were primarily introduced for streaming-tokenizer use cases such as java.util.Scanner.

Semantic Compatibility with java.util.regex

SafeRE aims to match java.util.regex behavior exactly, except where doing so would conflict with the linear-time guarantee or where JDK behavior is suspected to be a bug:

  1. Matcher.hitEnd() and Matcher.requireEnd() are unsupported — These APIs report backtracking-engine end-state behavior rather than ordinary regex match semantics. Supporting them exactly would require preserving JDK search-order details that SafeRE deliberately avoids in order to guarantee linear-time matching.
  2. Nested repetition with captures — In patterns like (a)*$, JDK's backtracking engine leaks captures from failed starting positions. JDK is itself internally inconsistent here ((a)*$ vs (?:(a))*$ give different group 1 results). SafeRE follows NFA-correct semantics. See issue #52 for details.

Both SafeRE and java.util.regex use leftmost-first alternation semantics (the first alternate that matches wins), which differs from POSIX leftmost-longest. This means SafeRE is a drop-in replacement for java.util.regex for alternation behavior.

Flags

SafeRE supports the same flag constants as java.util.regex.Pattern:

Flag Value Description
CASE_INSENSITIVE 2 Case-insensitive matching
MULTILINE 8 ^ and $ match at line boundaries
DOTALL 32 . matches line terminators
UNICODE_CASE 64 Unicode-aware case folding
UNICODE_CHARACTER_CLASS 256 Unicode-aware \w, \d, \s
COMMENTS 4 Permit whitespace and # comments
LITERAL 16 Treat pattern as a literal string
UNIX_LINES 1 Only \n is a line terminator
Pattern p = Pattern.compile("hello", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

PatternSet: Multi-Pattern Matching

SafeRE includes PatternSet, a SafeRE-only feature that matches multiple patterns simultaneously in a single pass (neither java.util.regex nor RE2/J offers this):

PatternSet.Builder builder = new PatternSet.Builder(PatternSet.Anchor.UNANCHORED);
int id0 = builder.add("error.*timeout");
int id1 = builder.add("warning.*disk");
int id2 = builder.add("info.*startup");
PatternSet set = builder.compile();

List<Integer> matches = set.match("error: connection timeout");
// matches contains id0

Migrating from java.util.regex

SafeRE is designed as a drop-in replacement. In most cases, you only need to change your imports:

// Before
import java.util.regex.Pattern;
import java.util.regex.Matcher;

// After
import org.safere.Pattern;
import org.safere.Matcher;

Validating with safere-crosscheck

To verify that SafeRE behaves identically to java.util.regex in your application, use the safere-crosscheck module. It provides Pattern and Matcher classes that run both engines on every operation and throw an exception if results diverge:

// Crosscheck mode — just change the import
import org.safere.crosscheck.Pattern;
import org.safere.crosscheck.Matcher;

Every call is recorded in a trace. If a divergence is found, the exception includes the full trace for easy bug reporting. See safere-crosscheck/README.md for details.

Fuzz Testing

SafeRE also has local Jazzer fuzz targets in safere-fuzz. They use safere-crosscheck as the oracle and can run either as regression tests over checked-in seeds or as coverage-guided fuzzers with JAZZER_FUZZ=1.

For parser dialect work, SafeRE also has an explicit long-running character-class syntax sweep. It lives outside the ordinary JUnit path because it enumerates a large generated matrix against java.util.regex; run it when working on character-class parsing:

./run-exhaustive-sweep.sh CharacterClassDivergenceSweep \
  --output-dir=target/exhaustive-reports/character-class-sweep-full

Use generated-case ranges when debugging a specific matrix region:

./run-exhaustive-sweep.sh CharacterClassDivergenceSweep --range=:1000000 \
  --output-dir=target/exhaustive-reports/character-class-sweep-smoke

See TESTING.md for the full testing workflow.

What works unchanged

  • Pattern.compile(), Pattern.matches(), Pattern.quote()
  • Matcher.matches(), lookingAt(), find(), group(), start(), end()
  • replaceFirst(), replaceAll(), appendReplacement(), appendTail()
  • split(), asPredicate(), asMatchPredicate()
  • All flags: CASE_INSENSITIVE, MULTILINE, DOTALL, UNICODE_CASE, etc.
  • Replacement strings: $1, ${name}, \\, \$

What to watch for

  1. Backreferences (\1, \2) — not supported; will throw PatternSyntaxException at compile time.
  2. Lookahead / lookbehind ((?=...), (?<=...)) — not supported.
  3. Possessive quantifiers (*+, ++) — not supported.
  4. Matcher.hitEnd() and Matcher.requireEnd() are not available.
  5. Named captures support Java-compatible (?<name>...) syntax and Python-style (?P<name>...) syntax.

See Semantic Compatibility for minor edge-case differences.

Architecture

The processing pipeline mirrors RE2:

Pattern string → Parse → Simplify → Compile → Execute
                  ↓         ↓          ↓          ↓
               Regexp     Regexp      Prog     Engine
               (AST)    (simpler)  (bytecode)  (match)

SafeRE automatically selects the fastest engine for each query:

Engine When Used Capabilities
Literal Pattern is a plain string String.indexOf() — fastest
OnePass Pattern is unambiguous Single-pass with captures
DFA General patterns Fast boolean match, no captures
Reverse DFA Multi-find on long text Bounds match range for NFA
BitState Small text × program Captures via backtracking with visited bitmap
NFA Fallback Full Pike VM, handles everything

For find() on long texts, SafeRE uses a three-DFA sandwich (like RE2):

  1. Forward DFA confirms a match exists and finds the earliest match end
  2. Reverse DFA scans backward to find the match start
  3. Anchored forward DFA finds the actual match end
  4. NFA extracts captures on just the bounded [start, end] range

For a detailed architecture walkthrough, see DESIGN.md.

Building

Requires Java 21+ and Maven.

# Build and install (library + benchmarks)
mvn install

# Run tests
mvn test -pl safere

# Generate Javadoc
mvn javadoc:javadoc -pl safere

Benchmarks

SafeRE includes a JMH benchmark suite in the safere-benchmarks module, comparing SafeRE against java.util.regex (JDK), RE2/J, RE2-FFM (C++ RE2 via Java FFM API), C++ RE2, and Go regexp. The suite includes focused microbenchmarks, data-driven application workloads, scaling/pathological cases, replacement, memory, and PatternSet benchmarks. Application workloads live in safere-benchmarks/benchmark-data.json, where each case defines its operation semantics and expected result for the Java, C++, and Go harnesses.

Publication-Quality Benchmark Collection

To collect a full set of benchmark data for updating BENCHMARKS.md, run the collection script from the repository root:

./collect-benchmark-results.sh

To verify the collection pipeline without doing a full run:

./collect-benchmark-results.sh --smoke

The script runs the Java, C++ RE2, and Go benchmark batches sequentially, captures raw output, extracts native JSON-lines results, and generates merged markdown tables.

By default, results are written to a timestamped directory under benchmark-results/, and benchmark-results/latest is updated to point to the newest run.

When the run finishes, hand off the result directory to the agent that will update BENCHMARKS.md:

benchmark-results/latest

The important files in that directory are:

jmh-output.txt
cpp-results.jsonl
go-results.jsonl
merged-tables.md
java-memory.txt
java-pattern-memory.txt

Targeted Benchmark Runs

Always use the wrapper scripts — they run mvn install first to ensure the benchmark module picks up the latest SafeRE code. These are useful for development iteration or focused investigation; use ./collect-benchmark-results.sh for full publication-quality collection.

# Java benchmarks (throughput)
./run-java-benchmarks.sh                        # standard benchmarks
./run-java-benchmarks.sh RegexBenchmark         # specific class
./run-java-benchmarks.sh ApplicationBenchmark   # application workloads

# Java memory profiling (allocation rates via JMH GC profiler)
./run-java-memory-benchmarks.sh                 # all benchmarks
./run-java-memory-benchmarks.sh RegexBenchmark  # specific class

# Fast development iteration only — NOT for BENCHMARKS.md
./run-java-benchmarks.sh --quick RegexBenchmark

CrosscheckOverheadBenchmark is excluded from the no-argument Java benchmark run. It measures overhead in the safere-crosscheck facade and should be run explicitly only when optimizing crosscheck:

./run-java-benchmarks.sh --quick CrosscheckOverheadBenchmark

C++ RE2 and Go Benchmarks

The benchmark suite includes C++ RE2 and Go regexp harnesses for cross-language comparison. Prerequisites: CMake ≥ 3.14 + C++17 compiler (for C++), Go ≥ 1.21 (for Go). Dependencies are fetched automatically.

# C++ RE2 benchmarks
./run-cpp-benchmarks.sh                    # all C++ benchmarks
./run-cpp-benchmarks.sh Regex Application  # specific benchmark groups

# Go regexp benchmarks
./run-go-benchmarks.sh                     # all Go benchmarks
./run-go-benchmarks.sh Regex Application   # specific benchmark groups

Comparing Results Manually

A comparison script merges JMH, C++, and Go results into side-by-side markdown:

python3 safere-benchmarks/scripts/compare-benchmarks.py \
  --jmh jmh-output.txt --json cpp-results.jsonl go-results.jsonl

To verify that all harnesses emitted the application benchmark names defined in benchmark-data.json, add:

python3 safere-benchmarks/scripts/compare-benchmarks.py \
  --jmh jmh-output.txt --json cpp-results.jsonl go-results.jsonl \
  --benchmark-data safere-benchmarks/benchmark-data.json \
  --check-application-names

Latest Results

See BENCHMARKS.md for full results. Highlights:

Benchmark SafeRE JDK RE2/J RE2-FFM C++ RE2 Go vs JDK
Literal match 9 ns 13 ns 126 ns 55 ns 40 ns 78 ns 1.4× faster
Capture groups (3) 94 ns 75 ns 958 ns 329 ns 84 ns 311 ns 1.3× slower
Capture groups (10) 200 ns 235 ns 1,398 ns 737 ns 367 ns 600 ns 1.2× faster
Hard pattern (1 MB) 0.019 µs 43,773 µs 37,857 µs 152 µs 0.048 µs 25,445 µs 2.3M× faster
Pathological (n=20) 0.072 µs 15,389 µs 6.64 µs 0.092 µs 0.071 µs 3.07 µs 210,808× faster
Literal replaceFirst 30 ns 40 ns 147 ns 215 ns 98 ns 605 ns 1.3× faster

Summary (geometric mean of speed ratios):

vs Core workloads Pathological/scaling
JDK 1.1× slower 13,500× faster
RE2/J 11.5× faster 2,930× faster
RE2-FFM 2.1× faster 17.3× faster

License

This project is a Java port of RE2. It also incorporates code from RE2/J, a Java port of Go's regexp package.

RE2 is Copyright (c) 2009 The RE2 Authors. All rights reserved.

RE2/J is Copyright (c) 2009 The Go Authors. All rights reserved.

This project contains code derived from both RE2 and RE2/J and is licensed under the BSD 3-Clause License, consistent with both original projects.

Modifications and Java port: Copyright (c) 2026 Eddie Aftandilian.

See LICENSE for details.

Acknowledgments

This work builds directly on the design and implementation of RE2 by the RE2 authors, and on RE2/J by the Go authors.

  • RE2 — the C++ library whose design and algorithms SafeRE is based on
  • Go regexp — the Go standard library implementation of RE2
  • RE2/J — an earlier port of RE2 to Java. SafeRE's parser, Java API layer, and portions of the test suite are derived from RE2/J (see TESTING.md)
  • Russ Cox's article series on regular expression matching — explains the theory behind RE2's approach

About

A safe linear-time regular expression matching library for Java, based on RE2

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors