Skip to content

Strengthen deobfuscation: lift spent string-array decoders, infer names, sharpen metrics#62

Merged
vasie1337 merged 4 commits into
mainfrom
claude/stoic-franklin-zft90c
Jun 8, 2026
Merged

Strengthen deobfuscation: lift spent string-array decoders, infer names, sharpen metrics#62
vasie1337 merged 4 commits into
mainfrom
claude/stoic-franklin-zft90c

Conversation

@vasie1337

Copy link
Copy Markdown
Member

Acts on the gaps the obfuscator.io metrics exposed once the synthetic corpus was counted (PR #61). Five parallel workstreams, integrated and verified together.

Results (obfuscator.io corpus, by profile)

profile kept% before → after hexrefs before → after
strarr_base64 68% → 28% 163 → 3
strarr_rc4 72% → 19% 211 → 3
numbers_keys 53% → 22% 231 → 71
strong 57% → 20% 385 → 132
all 140 56% → 26% 998 → 217

Corpus output shrank 328K → 154K bytes; the string-array profiles now deobfuscate to near the already-solved controlflow/deadcode/minimal baseline.

What changed

1. Decoder scaffolding cleanup (src/passes/dce.rs) — the high-leverage fix. The Boa sandbox was already lifting string-array decoders correctly, but javascript-obfuscator's accessor memoizes through its own name (if (f.flag===undefined){ f.cache={} } … f.cache[k] …). After inlining every call site, the only surviving references to f were reads of its own properties inside its own body, which pinned the spent decoder and its entire encoded array alive. fn_decl_is_dead now treats a function as dead when every resolved read of its symbol is lexically inside its own body (shadowing-safe), keeping the existing guard for still-called self-reassigning functions.

2. Name inference (src/passes/rename.rs)RenameByRole now names array-iteration callback params (reduceacc/value, map/filter/forEachitem/index, sortleft/right), C-style loop counters (index), and catch bindings (error) instead of generic varN. Names stay ≥3 chars so they remain idempotent under the opaque guard.

3. hexrefs metric (src/bin/report.rs, tests/golden.rs)opaque% counts distinct tokens, so a single decoder referenced N times barely moved it. hexrefs (raw _0x… occurrence count) spikes when a decoder survives intact, so the board now flags the worst failures instead of greenlighting them.

4. Optional-chain member normalization (src/passes/member_normalize.rs)a?.["foo"] parses as a ChainElement, not an Expression, so it was never normalized; added enter_chain_element to rewrite to a?.foo (identifier keys only).

Two investigations (no code change warranted)

  • recaptcha/hcaptcha 2× expansion is legitimate, not a bug: formatter whitespace on minified input plus the rename pass expanding 16K+ single-char identifiers to readable names (opaque% 36%→8% on sample_14). Optional future tweak: short-base v1…v999 naming to cut the size cost.
  • sample_9 is a custom bytecode VM — the namesake devirtualization target. It has no string-array/CFG-flattening/proxy constructs the current passes target, so it passes through unchanged. Recovering it needs a dedicated VM-devirtualization pass (large, separate effort).

Known remaining residue

strong/numbers_keys keep some _0x refs inside opaque-predicate dead branches wrapped in object-method proxies (obj.m(obj.x, obj.x)"a"==="a"); collapsing those needs proxy inlining + predicate folding, a separate layer. The brackets metric also overcounts — most residual [" on sample_7/10/3 are array/object literals and decoder-gated base64 keys, not convertible member access.

Verification

Full slow net green: golden snapshots (re-blessed), sample_equivalence (behavior preserved on every real sample), generated_corpus (deobfuscated form reproduces manifest output on all 140), determinism (5), equivalence (102), phase1 (17). cargo clippy clean.

https://claude.ai/code/session_01EjhNTCU89wa5zaeRHMnfEc


Generated by Claude Code

claude added 4 commits June 8, 2026 17:21
The synthetic javascript-obfuscator (obfuscator.io) fixtures in
samples/generated/ gated correctness via manifest.json but were excluded
from the readability metrics: both the live report binary and the
committed SCOREBOARD.md read samples/ non-recursively.

Add a per-profile rollup of the generated corpus (aggregated over all
seeds, one row per obfuscation technique) to both surfaces, so the
obfuscator.io samples count toward readability the same way they count
toward correctness. kept% is byte-weighted, opaque% is the mean per-file
ratio, rounds is the worst case, and converged flags any non-fixpoint.
rename: RenameByRole now infers meaningful names for array-iteration
callback params (reduce->acc/value, map/filter/forEach->item/index,
sort->left/right), C-style loop counters (->index), and catch bindings
(->error), instead of falling back to generic varN. Names stay >=3 chars
so they remain idempotent under the opaque-name guard, and reuse the
existing scope de-duplicator.

report/golden: add a hexrefs column (raw, non-distinct count of _0x...
identifier occurrences) to the live dashboard and committed scoreboard.
opaque% counts DISTINCT tokens, so a single surviving decoder referenced
N times barely moved it; hexrefs spikes when a string-array decoder is
left intact, so the board now flags the worst failures (strarr_base64
163, strarr_rc4 211, numbers_keys 231, strong 385) instead of
greenlighting them. Snapshots/scoreboard re-blessed.
…ional-chain members

dce: a string-array decoder's accessor memoizes through its own name
(if (f.flag===undefined){ f.cache={}; ... } ... f.cache[k] ...). After
every call site is inlined by decoder-lift, the only surviving references
to f are reads of its own properties inside its own body, which pinned
the spent decoder and its entire encoded string array alive forever.
fn_decl_is_dead now treats a function as dead when every resolved read of
its symbol is lexically inside its own body (shadowing-safe via reference
resolution), with the existing guard that a still-called self-reassigning
function is kept. Collapses the obfuscator.io string-array profiles:
strarr_rc4 kept 72%->19% (hexrefs 211->3), strarr_base64 68%->28%
(163->3); corpus output 328K->154K bytes, hexrefs 998->217.

member-normalize: a?.["foo"] parses as a ChainElement, not an
Expression, so optional-chained computed members were never normalized.
Added enter_chain_element to rewrite them to a?.foo (identifier keys
only, optional flag preserved). Covered by a new phase1 test.

Snapshots/scoreboard re-blessed; full equivalence/determinism/corpus net green.
…-zft90c

# Conflicts:
#	src/bin/report.rs
#	tests/golden.rs
#	tests/snapshots/SCOREBOARD.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants