Strengthen deobfuscation: lift spent string-array decoders, infer names, sharpen metrics by vasie1337 · Pull Request #62 · devirt-dev/devirt-core

vasie1337 · 2026-06-08T18:28:30Z

Acts on the gaps the obfuscator.io metrics exposed once the synthetic corpus was counted (PR #61). Five parallel workstreams, integrated and verified together.

Results (obfuscator.io corpus, by profile)

profile	kept% before → after	hexrefs before → after
strarr_base64	68% → 28%	163 → 3
strarr_rc4	72% → 19%	211 → 3
numbers_keys	53% → 22%	231 → 71
strong	57% → 20%	385 → 132
all 140	56% → 26%	998 → 217

Corpus output shrank 328K → 154K bytes; the string-array profiles now deobfuscate to near the already-solved controlflow/deadcode/minimal baseline.

What changed

1. Decoder scaffolding cleanup (src/passes/dce.rs) — the high-leverage fix. The Boa sandbox was already lifting string-array decoders correctly, but javascript-obfuscator's accessor memoizes through its own name (if (f.flag===undefined){ f.cache={} } … f.cache[k] …). After inlining every call site, the only surviving references to f were reads of its own properties inside its own body, which pinned the spent decoder and its entire encoded array alive. fn_decl_is_dead now treats a function as dead when every resolved read of its symbol is lexically inside its own body (shadowing-safe), keeping the existing guard for still-called self-reassigning functions.

2. Name inference (src/passes/rename.rs) — RenameByRole now names array-iteration callback params (reduce→acc/value, map/filter/forEach→item/index, sort→left/right), C-style loop counters (index), and catch bindings (error) instead of generic varN. Names stay ≥3 chars so they remain idempotent under the opaque guard.

3. hexrefs metric (src/bin/report.rs, tests/golden.rs) — opaque% counts distinct tokens, so a single decoder referenced N times barely moved it. hexrefs (raw _0x… occurrence count) spikes when a decoder survives intact, so the board now flags the worst failures instead of greenlighting them.

4. Optional-chain member normalization (src/passes/member_normalize.rs) — a?.["foo"] parses as a ChainElement, not an Expression, so it was never normalized; added enter_chain_element to rewrite to a?.foo (identifier keys only).

Two investigations (no code change warranted)

recaptcha/hcaptcha 2× expansion is legitimate, not a bug: formatter whitespace on minified input plus the rename pass expanding 16K+ single-char identifiers to readable names (opaque% 36%→8% on sample_14). Optional future tweak: short-base v1…v999 naming to cut the size cost.
sample_9 is a custom bytecode VM — the namesake devirtualization target. It has no string-array/CFG-flattening/proxy constructs the current passes target, so it passes through unchanged. Recovering it needs a dedicated VM-devirtualization pass (large, separate effort).

Known remaining residue

strong/numbers_keys keep some _0x refs inside opaque-predicate dead branches wrapped in object-method proxies (obj.m(obj.x, obj.x) ≡ "a"==="a"); collapsing those needs proxy inlining + predicate folding, a separate layer. The brackets metric also overcounts — most residual [" on sample_7/10/3 are array/object literals and decoder-gated base64 keys, not convertible member access.

Verification

Full slow net green: golden snapshots (re-blessed), sample_equivalence (behavior preserved on every real sample), generated_corpus (deobfuscated form reproduces manifest output on all 140), determinism (5), equivalence (102), phase1 (17). cargo clippy clean.

https://claude.ai/code/session_01EjhNTCU89wa5zaeRHMnfEc

Generated by Claude Code

The synthetic javascript-obfuscator (obfuscator.io) fixtures in samples/generated/ gated correctness via manifest.json but were excluded from the readability metrics: both the live report binary and the committed SCOREBOARD.md read samples/ non-recursively. Add a per-profile rollup of the generated corpus (aggregated over all seeds, one row per obfuscation technique) to both surfaces, so the obfuscator.io samples count toward readability the same way they count toward correctness. kept% is byte-weighted, opaque% is the mean per-file ratio, rounds is the worst case, and converged flags any non-fixpoint.

rename: RenameByRole now infers meaningful names for array-iteration callback params (reduce->acc/value, map/filter/forEach->item/index, sort->left/right), C-style loop counters (->index), and catch bindings (->error), instead of falling back to generic varN. Names stay >=3 chars so they remain idempotent under the opaque-name guard, and reuse the existing scope de-duplicator. report/golden: add a hexrefs column (raw, non-distinct count of _0x... identifier occurrences) to the live dashboard and committed scoreboard. opaque% counts DISTINCT tokens, so a single surviving decoder referenced N times barely moved it; hexrefs spikes when a string-array decoder is left intact, so the board now flags the worst failures (strarr_base64 163, strarr_rc4 211, numbers_keys 231, strong 385) instead of greenlighting them. Snapshots/scoreboard re-blessed.

…ional-chain members dce: a string-array decoder's accessor memoizes through its own name (if (f.flag===undefined){ f.cache={}; ... } ... f.cache[k] ...). After every call site is inlined by decoder-lift, the only surviving references to f are reads of its own properties inside its own body, which pinned the spent decoder and its entire encoded string array alive forever. fn_decl_is_dead now treats a function as dead when every resolved read of its symbol is lexically inside its own body (shadowing-safe via reference resolution), with the existing guard that a still-called self-reassigning function is kept. Collapses the obfuscator.io string-array profiles: strarr_rc4 kept 72%->19% (hexrefs 211->3), strarr_base64 68%->28% (163->3); corpus output 328K->154K bytes, hexrefs 998->217. member-normalize: a?.["foo"] parses as a ChainElement, not an Expression, so optional-chained computed members were never normalized. Added enter_chain_element to rewrite them to a?.foo (identifier keys only, optional flag preserved). Covered by a new phase1 test. Snapshots/scoreboard re-blessed; full equivalence/determinism/corpus net green.

…-zft90c # Conflicts: # src/bin/report.rs # tests/golden.rs # tests/snapshots/SCOREBOARD.md

claude added 4 commits June 8, 2026 17:21

Merge remote-tracking branch 'origin/main' into claude/stoic-franklin…

bbd3477

…-zft90c # Conflicts: # src/bin/report.rs # tests/golden.rs # tests/snapshots/SCOREBOARD.md

vasie1337 merged commit 0a725ef into main Jun 8, 2026

vasie1337 mentioned this pull request Jun 8, 2026

Follow-ups: object-literal reconstruction pass + accurate brackets metric #63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strengthen deobfuscation: lift spent string-array decoders, infer names, sharpen metrics#62

Strengthen deobfuscation: lift spent string-array decoders, infer names, sharpen metrics#62
vasie1337 merged 4 commits into
mainfrom
claude/stoic-franklin-zft90c

vasie1337 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vasie1337 commented Jun 8, 2026

Results (obfuscator.io corpus, by profile)

What changed

Two investigations (no code change warranted)

Known remaining residue

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants