Skip to content

DavisPL/many-regex

Repository files navigation

Many Regex logo

Can some linear-time regex engines be considered harmful? A runtime analysis of linear-time regex engines in the context of production software systems.

Quick Related Links

Introduction

Linear-time Regex engines are considered the gold standard for reducing the risk of Regular Expression Denial of Service (ReDoS) attacks. However, engines that operate in linear-time can in theory still cause harm to software systems if the coefficient of the linear runtime is large enough. We investigate if any linear-time Regex engines found in either literature or libraries can be considered harmful in the context of production software systems, by causing a large enough stall in runtime.

ReDoS Found

Important

ReDoS Vulnerability Found! Use the following one-liner to run it if you have uv installed.

This code should timeout, as it tries to compute an exponentially large Regex.

uv run --with pyre2==0.3.10 python -c "import re2 as pyre2; pyre2.match('^(?=(a+)+b)\\w+$', 'a' * 50)"

You can also run run_pyre2_timeout_simple.py to see proof of concept.

image

HATRA

I will be submitting to HATRA 2026.

Work to be done:

  1. Create a system to measure the following for a regex and input pair:
  • memory usage
  • regex match time
  • regex compile time (input independent)
  • AST depth (input independent)
  • number of loops (input independent)
  1. What is the reality* and max* of the input + regex for top 30 packages in two linear regex engines
  • reality is defined as the normal situation that this regex will be used in
  • max is defined as the most extreme case of regex operation permitted by the other code (ex. input lenght truncation)

Included

  1. Python code to run regex patterns against many different Python libraries (main.py, run_pyre2_timeout_simple.py, run_pyre2_timeout10_large.py, test_pyre2_on_36.py, etc.)

  2. C# code to test the default Dotnet Regex library and RE# (with full results)

  3. TypeScript code to test regex libraries under the Bun runtime

  4. Test cases JSON — the standardized ReDoS test cases used across Python, TypeScript, and C#

  5. Graphing tools to interpret and visualize the runtime output (graph.py, graph_scaling.py, graph_resh_results.py, results_table.py, etc.)

  6. JSON result data for each language and timeout setting (py_redos_test_results.json, ts_redos_test_results.json, csharp_redos_test_results.json, scaling tests, and timeout variants)

  7. Images — graphs, tables, and figures referenced throughout this README

    Preview of generated images

  8. A list of datasets for ReDoS

Roadmap

  • include Python libraries
  • include JavaScript / TypeScript libraries
  • include Go libraries
  • include Rust libraries
  • include Re# and Dotnet library
  • Vary input size and not just input pattern
  • Make table of Regex libraries
  • Collect more regex patterns from literature
  • Draft up poster for initial review
  • Make ReDoS test cases JSON
  • Run scaling on tests in ./python/run_pyre2_timeout10_large.py to check for exponential behavior

Harmfulness Scale

image

Libraries Tested

Name Language Claimed to be linear
Re Python No
Dotnet Regex C# No
Regex Python Reduces backtracking chance but no guarantee
Rure Python Yes "guarantees linear time"
Pyre2 Python Yes "guarantees linear-time behavior"
RE# C# Yes "the main matching algorithm has input-linear complexity both in theory as well as experimentally"
Regolith JavaScript Yes "guarantees linear time"
RegExp Go Yes "guaranteed to run in time linear"
Regex Rust Yes "worst time O(m*nt)"

These libraries were picked after I searched for "linear time regex library python". Re2 was removed from the test because it could not be installed. Similarly, Regexy was archived and out of date, so it too was excluded.

I use Python's default "re" library as a control even though it does not claim to be linear time.

Experiments ToC

Test 1 and 2 were done in just Python

Test 1 -- Scaling Test

Methods

Each Regex pattern was run with an input size of 0 to 30 on all 4 of the tested Regex libraries. Each line represents a different Regex library, the y axis represents time on a log scale with a hard timeout at 2 seconds. The regex patterns where created by asking Claude Sonnet 4.5 for regex patterns that may lead to catastrophic backtracking.

Here is an example of one of the tests where both Regex and Re can be considered harmful.

test_4_performance

Here is a list of each test run that links to its corresponding graph.

  1. Nested quantifiers (^(a+)+$)
  2. Nested quantifiers with Kleene star (^(a*)*$)
  3. Nested quantifiers with mismatch (^(a+)+b$)
  4. Alternation with overlapping patterns (^(a|a)*$)
  5. Alternation with prefix overlap (^(a|ab)*$)
  6. Multiple alternations ((a|a|a|a|a|b)*c)
  7. Triple nested groups (^((a+)+)+$)
  8. Nested Kleene star with suffix (^(a*)*b$)
  9. Nested plus with suffix (^(a+)*b$)
  10. Email-like pattern (ReDoS)
  11. Overlapping character classes lowercase (^([a-z]+)+[A-Z]$)
  12. Overlapping character classes alphanumeric (^([0-9a-z]+)+[A-Z]$)
  13. Wildcard nested quantifiers (^(.*)*$)
  14. Wildcard plus nested (^(.+)+$)
  15. Wildcard with suffix (^(.*)+b$)
  16. Multiple overlapping quantifiers (^(a*)+b$)
  17. Optional nested quantifiers (^(a?)+b$)
  18. Non-greedy nested quantifiers (^(a*?)*b$)
  19. Word boundary catastrophic (^(\\w+\\s*)+$)
  20. Word with spaces pattern (^([\\w]+[\\s]*)*$)
  21. Digit nested plus (^(\\d+)+$)
  22. Digit nested star (^([0-9]+)*$)
  23. Complex alternation plus (^(a+|a+)+$)
  24. Complex alternation star (^(a*|a*)*$)
  25. Alternation with length variation (^(aa+|a+)+$)
  26. URL pattern (simplified)
  27. Whitespace with letters (^(\\s*a+\\s*)+$)
  28. Whitespace alternation (^(\\s+|a+)*b$)
  29. Optional group patterns (^(a+)?b?(a+)?$)
  30. Optional with nested groups (^(a+b?)+c$)
  31. Character class repetition (^([a-zA-Z]+)*$)
  32. Alphanumeric with symbol (^([a-z0-9]+)+[!]$)
  33. Nested alternation simple (^((a|b)+)+c$)
  34. Nested alternation overlap (^((a|ab)+)+c$)
  35. Long repeating with suffix (^(a+b)+c$)
  36. Repeating pattern variation (^(ab+)+c$)

Results

Name Language Claimed to be linear Found to be harmful Quantity of harmful results (out of 36)
Re Python No Yes 25
Rure Python Yes "guarantees linear time" No 0
Regex Python Reduces backtracking chance but no guarantee Yes 1
Pyre2 Python Yes "guarantees linear-time behavior" No 0

Test 2 -- Preliminary Results

This was the first test I ran where each pattern was run with a single input size. These results are preliminary and were to test if I was using a reasonable method for running regex patterns.

regex_benchmark_comparison regex_benchmark_line_chart

Test 3 -- Dotnet & RE# Test

We run Program.cs with dotnet run. This tests runs 113 tests in both the RE# library and the default Dotnet Regex library. The RE# library has zero cases that can be considered harmful, but 75 cases that can be conspired harmful. Those results are expected, as the Dotnet Regex library does not claim to be linear-time and RE# does claim to be linear.

RE# Results

Included are the full results.

Test 4 -- Check Python, TypeScript (bun runtime), and C# (.NET)

I standardized the tests into a JSON file called test_cases.json and changed how test cases are handled in Python, TS, and C# to use this test case file. I ran each language on these test cases and to get the results py_redos_test_results.json, ts_redos_test_results.json, csharp_redos_test_results.json. I then created results_table.py that produced a few graphs and tables.

Results Table

A few takeaways:

  1. C# Regex is very vulnerable to ReDoS compared to the other languages, failing in 40 test cases for each 3 of the runs
  2. We did not find evidence that any library that claimed to be linear-time can be considered harmful

Notes

I had an issue installing https://pypi.org/project/re2.

I found a pull request from one of the authors of resharp where they optimize the dotnet regex library dotnet/runtime#102655

The source code for resharp has been moved or removed https://github.com/ieviev/resharp

You can install it from the library website https://www.nuget.org/packages/Resharp

About

A Regex execution engine that tests a pattern with many different engines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors