Encoding Attack Vectors — study + runnable PoCs

A reference repo on how character-encoding, Unicode, and string-truncation pitfalls turn into real CVEs. Every attack vector has:

A markdown writeup in attack_vectors/ — diagram, mechanism, mitigation.
A self-contained runnable PoC in pocs/ that builds a sample, runs it, and prints the divergence between "what a reviewer sees" and "what the runtime does".
A defender script (pocs/99_detect_bidi.py) that scans source trees for bidi / zero-width / homograph characters.

Headline vector: Bidi Trojan Source (CVE-2021-42574). The repo name predates the rest of the content.

Quick start

git clone https://github.com/francose/bidi_poc.git
cd bidi_poc

# Runs the marquee attack end-to-end and shows the runtime divergence
python pocs/01_bidi_trojan_source.py

# Scan a tree for invisible characters
python pocs/99_detect_bidi.py samples/

# Self-test the scanner
python pocs/99_detect_bidi.py --self-test

# Operator-side: spray encoded payload variants
python tools/encode_payload.py "<script>alert(1)</script>" --only url2,html-dec

# Generate phishing-relevant lookalikes for a domain
python tools/lookalike_domain.py paypal.com --only homograph,typo-omit

No external dependencies — Python 3.10+ stdlib only.

What's in the box

Study guide

encoding_attack_vectors_study.md — the top-level reference: 8 attack categories with diagrams, mitigations, and study questions.
attack_vectors/ — one short markdown per category, for quick lookup or linking from issues / PRs:
- 01_buffer_overflow.md
- 02_null_byte_injection.md
- 03_homograph_attacks.md
- 04_utf8_overlong_encoding.md
- 05_bom_injection.md
- 06_memory_disclosure.md
- 07_packed_decimal_attacks.md
- 08_double_encoding.md
- 09_bidi_trojan_source.md

Operator tools (red-team / authorized testing)

Script	Output
`tools/encode_payload.py`	Payload encoding spray — 13 variant families (url/url2/url3, html-dec, html-hex, unicode-esc, hex-esc, overlong UTF-8, null-suffix, bidi-wrap, space-tab, mixed-case, base64). One line per variant, pipe straight into Burp Intruder / `ffuf -w -` / `wfuzz`.
`tools/lookalike_domain.py`	Phishing domain generator — homograph, typo (omit/swap/double/replace), TLD swap, bitsquat, hyphenate. Emits punycode/IDNA form for each non-ASCII candidate. CSV mode for tracking.
`tools/filename_bypass.py`	File-upload allowlist bypass — null-byte (`\x00`, `%00`, `%2500`), multi-extension (Apache mod_mime), trailing dot/space (Windows), Unicode dot lookalikes, overlong UTF-8 dot, bidi-reversed extension.
`tools/bidi_inject.py`	Source-code planter — `comment-veil` mode hides executable code behind a fake C-style `/* comment */`; `string-stretch` mode hides extra content inside a string literal. For supply-chain / code-review evasion research.
`tools/typosquat_package.py`	Package-name typosquat generator (PyPI / npm / RubyGems / crates). Optional live registry probe (`--check pypi`) reports availability per candidate.

All tools are stdlib-only, single-file, designed for piping. Use with authorization — same MIT/disclaimer terms as the rest of the repo.

Runnable PoCs

Each script prints a clearly labelled walkthrough — encoded bytes, filter view, runtime view, and the gap between them.

Script	What it demonstrates
`pocs/00_encoding_inspector.py`	Side-by-side UTF-8 / UTF-16 / UTF-32 / Packed-Decimal byte layout for any input string.
`pocs/01_bidi_trojan_source.py`	Stretched-string Trojan Source attack: rendered source shows a benign user-role check, runtime grants admin.
`pocs/02_homograph.py`	Cyrillic / Greek lookalikes — Python identifier shadow + paypal.com vs paypаl.com.
`pocs/03_overlong_utf8.py`	IIS-style filter bypass — overlong-encoded '/' passes a byte-level filter, a permissive decoder normalises it back.
`pocs/04_null_byte.py`	Python-vs-C length-of-string asymmetry — allow-listed `secret.txt` opens `/path/to/secret` at the syscall layer.
`pocs/05_double_encoding.py`	Two-stage decode pipeline — `%253Cscript%253E` slips past a WAF that decodes once, lands at the origin as `<script>`.
`pocs/99_detect_bidi.py`	Defender tool — scans paths for bidi controls, zero-width characters, and ASCII-confusable Unicode letters. CI-friendly exit codes.

Samples

samples/trojan_sample.py — generated by PoC 01. Open it in any editor and compare the rendered text with the raw bytes to feel the attack.

Why this exists

Encoding bugs are recurring across:

Source-code review (Trojan Source, homograph identifiers)
Input validation (overlong UTF-8, null-byte path bypass, double encoding)
Filesystem / shell / SQL boundaries (NUL truncation, normalisation gaps)
Phishing and supply-chain attacks (homograph domains, IDN spoofing)

The mitigations are well-known. The bugs keep shipping. This repo exists to make the mechanism concrete enough that you don't have to re-derive "why does 0xC0 0xAF matter" the next time you see it in a PR.

License

MIT. Educational / authorized-research use only — do not point the PoCs at systems you don't own.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
attack_vectors		attack_vectors
pocs		pocs
samples		samples
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
encoding_attack_vectors_study.md		encoding_attack_vectors_study.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Encoding Attack Vectors — study + runnable PoCs

Quick start

What's in the box

Study guide

Operator tools (red-team / authorized testing)

Runnable PoCs

Samples

Why this exists

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Encoding Attack Vectors — study + runnable PoCs

Quick start

What's in the box

Study guide

Operator tools (red-team / authorized testing)

Runnable PoCs

Samples

Why this exists

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages