A reference repo on how character-encoding, Unicode, and string-truncation pitfalls turn into real CVEs. Every attack vector has:
- A markdown writeup in
attack_vectors/— diagram, mechanism, mitigation. - A self-contained runnable PoC in
pocs/that builds a sample, runs it, and prints the divergence between "what a reviewer sees" and "what the runtime does". - A defender script (
pocs/99_detect_bidi.py) that scans source trees for bidi / zero-width / homograph characters.
Headline vector: Bidi Trojan Source (CVE-2021-42574). The repo name predates the rest of the content.
git clone https://github.com/francose/bidi_poc.git
cd bidi_poc
# Runs the marquee attack end-to-end and shows the runtime divergence
python pocs/01_bidi_trojan_source.py
# Scan a tree for invisible characters
python pocs/99_detect_bidi.py samples/
# Self-test the scanner
python pocs/99_detect_bidi.py --self-test
# Operator-side: spray encoded payload variants
python tools/encode_payload.py "<script>alert(1)</script>" --only url2,html-dec
# Generate phishing-relevant lookalikes for a domain
python tools/lookalike_domain.py paypal.com --only homograph,typo-omitNo external dependencies — Python 3.10+ stdlib only.
encoding_attack_vectors_study.md— the top-level reference: 8 attack categories with diagrams, mitigations, and study questions.attack_vectors/— one short markdown per category, for quick lookup or linking from issues / PRs:01_buffer_overflow.md02_null_byte_injection.md03_homograph_attacks.md04_utf8_overlong_encoding.md05_bom_injection.md06_memory_disclosure.md07_packed_decimal_attacks.md08_double_encoding.md09_bidi_trojan_source.md
| Script | Output |
|---|---|
tools/encode_payload.py |
Payload encoding spray — 13 variant families (url/url2/url3, html-dec, html-hex, unicode-esc, hex-esc, overlong UTF-8, null-suffix, bidi-wrap, space-tab, mixed-case, base64). One line per variant, pipe straight into Burp Intruder / ffuf -w - / wfuzz. |
tools/lookalike_domain.py |
Phishing domain generator — homograph, typo (omit/swap/double/replace), TLD swap, bitsquat, hyphenate. Emits punycode/IDNA form for each non-ASCII candidate. CSV mode for tracking. |
tools/filename_bypass.py |
File-upload allowlist bypass — null-byte (\x00, %00, %2500), multi-extension (Apache mod_mime), trailing dot/space (Windows), Unicode dot lookalikes, overlong UTF-8 dot, bidi-reversed extension. |
tools/bidi_inject.py |
Source-code planter — comment-veil mode hides executable code behind a fake C-style /* comment */; string-stretch mode hides extra content inside a string literal. For supply-chain / code-review evasion research. |
tools/typosquat_package.py |
Package-name typosquat generator (PyPI / npm / RubyGems / crates). Optional live registry probe (--check pypi) reports availability per candidate. |
All tools are stdlib-only, single-file, designed for piping. Use with authorization — same MIT/disclaimer terms as the rest of the repo.
Each script prints a clearly labelled walkthrough — encoded bytes, filter view, runtime view, and the gap between them.
| Script | What it demonstrates |
|---|---|
pocs/00_encoding_inspector.py |
Side-by-side UTF-8 / UTF-16 / UTF-32 / Packed-Decimal byte layout for any input string. |
pocs/01_bidi_trojan_source.py |
Stretched-string Trojan Source attack: rendered source shows a benign user-role check, runtime grants admin. |
pocs/02_homograph.py |
Cyrillic / Greek lookalikes — Python identifier shadow + paypal.com vs paypаl.com. |
pocs/03_overlong_utf8.py |
IIS-style filter bypass — overlong-encoded '/' passes a byte-level filter, a permissive decoder normalises it back. |
pocs/04_null_byte.py |
Python-vs-C length-of-string asymmetry — allow-listed secret.txt opens /path/to/secret at the syscall layer. |
pocs/05_double_encoding.py |
Two-stage decode pipeline — %253Cscript%253E slips past a WAF that decodes once, lands at the origin as <script>. |
pocs/99_detect_bidi.py |
Defender tool — scans paths for bidi controls, zero-width characters, and ASCII-confusable Unicode letters. CI-friendly exit codes. |
samples/trojan_sample.py— generated by PoC 01. Open it in any editor and compare the rendered text with the raw bytes to feel the attack.
Encoding bugs are recurring across:
- Source-code review (Trojan Source, homograph identifiers)
- Input validation (overlong UTF-8, null-byte path bypass, double encoding)
- Filesystem / shell / SQL boundaries (NUL truncation, normalisation gaps)
- Phishing and supply-chain attacks (homograph domains, IDN spoofing)
The mitigations are well-known. The bugs keep shipping. This repo exists
to make the mechanism concrete enough that you don't have to re-derive
"why does 0xC0 0xAF matter" the next time you see it in a PR.
MIT. Educational / authorized-research use only — do not point the PoCs at systems you don't own.