Skip to content

francose/bidi_poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Encoding Attack Vectors — study + runnable PoCs

A reference repo on how character-encoding, Unicode, and string-truncation pitfalls turn into real CVEs. Every attack vector has:

  1. A markdown writeup in attack_vectors/ — diagram, mechanism, mitigation.
  2. A self-contained runnable PoC in pocs/ that builds a sample, runs it, and prints the divergence between "what a reviewer sees" and "what the runtime does".
  3. A defender script (pocs/99_detect_bidi.py) that scans source trees for bidi / zero-width / homograph characters.

Headline vector: Bidi Trojan Source (CVE-2021-42574). The repo name predates the rest of the content.


Quick start

git clone https://github.com/francose/bidi_poc.git
cd bidi_poc

# Runs the marquee attack end-to-end and shows the runtime divergence
python pocs/01_bidi_trojan_source.py

# Scan a tree for invisible characters
python pocs/99_detect_bidi.py samples/

# Self-test the scanner
python pocs/99_detect_bidi.py --self-test

# Operator-side: spray encoded payload variants
python tools/encode_payload.py "<script>alert(1)</script>" --only url2,html-dec

# Generate phishing-relevant lookalikes for a domain
python tools/lookalike_domain.py paypal.com --only homograph,typo-omit

No external dependencies — Python 3.10+ stdlib only.


What's in the box

Study guide

  • encoding_attack_vectors_study.md — the top-level reference: 8 attack categories with diagrams, mitigations, and study questions.
  • attack_vectors/ — one short markdown per category, for quick lookup or linking from issues / PRs:
    • 01_buffer_overflow.md
    • 02_null_byte_injection.md
    • 03_homograph_attacks.md
    • 04_utf8_overlong_encoding.md
    • 05_bom_injection.md
    • 06_memory_disclosure.md
    • 07_packed_decimal_attacks.md
    • 08_double_encoding.md
    • 09_bidi_trojan_source.md

Operator tools (red-team / authorized testing)

Script Output
tools/encode_payload.py Payload encoding spray — 13 variant families (url/url2/url3, html-dec, html-hex, unicode-esc, hex-esc, overlong UTF-8, null-suffix, bidi-wrap, space-tab, mixed-case, base64). One line per variant, pipe straight into Burp Intruder / ffuf -w - / wfuzz.
tools/lookalike_domain.py Phishing domain generator — homograph, typo (omit/swap/double/replace), TLD swap, bitsquat, hyphenate. Emits punycode/IDNA form for each non-ASCII candidate. CSV mode for tracking.
tools/filename_bypass.py File-upload allowlist bypass — null-byte (\x00, %00, %2500), multi-extension (Apache mod_mime), trailing dot/space (Windows), Unicode dot lookalikes, overlong UTF-8 dot, bidi-reversed extension.
tools/bidi_inject.py Source-code planter — comment-veil mode hides executable code behind a fake C-style /* comment */; string-stretch mode hides extra content inside a string literal. For supply-chain / code-review evasion research.
tools/typosquat_package.py Package-name typosquat generator (PyPI / npm / RubyGems / crates). Optional live registry probe (--check pypi) reports availability per candidate.

All tools are stdlib-only, single-file, designed for piping. Use with authorization — same MIT/disclaimer terms as the rest of the repo.

Runnable PoCs

Each script prints a clearly labelled walkthrough — encoded bytes, filter view, runtime view, and the gap between them.

Script What it demonstrates
pocs/00_encoding_inspector.py Side-by-side UTF-8 / UTF-16 / UTF-32 / Packed-Decimal byte layout for any input string.
pocs/01_bidi_trojan_source.py Stretched-string Trojan Source attack: rendered source shows a benign user-role check, runtime grants admin.
pocs/02_homograph.py Cyrillic / Greek lookalikes — Python identifier shadow + paypal.com vs paypаl.com.
pocs/03_overlong_utf8.py IIS-style filter bypass — overlong-encoded '/' passes a byte-level filter, a permissive decoder normalises it back.
pocs/04_null_byte.py Python-vs-C length-of-string asymmetry — allow-listed secret.txt opens /path/to/secret at the syscall layer.
pocs/05_double_encoding.py Two-stage decode pipeline — %253Cscript%253E slips past a WAF that decodes once, lands at the origin as <script>.
pocs/99_detect_bidi.py Defender tool — scans paths for bidi controls, zero-width characters, and ASCII-confusable Unicode letters. CI-friendly exit codes.

Samples

  • samples/trojan_sample.py — generated by PoC 01. Open it in any editor and compare the rendered text with the raw bytes to feel the attack.

Why this exists

Encoding bugs are recurring across:

  • Source-code review (Trojan Source, homograph identifiers)
  • Input validation (overlong UTF-8, null-byte path bypass, double encoding)
  • Filesystem / shell / SQL boundaries (NUL truncation, normalisation gaps)
  • Phishing and supply-chain attacks (homograph domains, IDN spoofing)

The mitigations are well-known. The bugs keep shipping. This repo exists to make the mechanism concrete enough that you don't have to re-derive "why does 0xC0 0xAF matter" the next time you see it in a PR.


License

MIT. Educational / authorized-research use only — do not point the PoCs at systems you don't own.

About

Encoding & Unicode attack vectors — study guide + runnable PoCs (Trojan Source, homograph, overlong UTF-8, null-byte truncation, double encoding) + defender scanner.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages