Skip to content

android: Fix GC hook installation on Android 16 (CMC + stripped libart)#394

Open
Kolektori wants to merge 1 commit into
frida:mainfrom
Kolektori:android-16-libart-cmc-gc-hook
Open

android: Fix GC hook installation on Android 16 (CMC + stripped libart)#394
Kolektori wants to merge 1 commit into
frida:mainfrom
Kolektori:android-16-libart-cmc-gc-hook

Conversation

@Kolektori
Copy link
Copy Markdown

Fixes #387.

On Android 16 (e.g. build BP4A.251205.006, com.android.art@361302280) the process crashes with a NULL dereference in art::CodeInfo::DecodeGcMasksOnly during a GC stack walk whenever a replacement ArtMethod is on some thread's stack when the collector runs. The reporter's backtrace in #387 shows the fault driven by art::gc::collector::MarkCompact::RunPhasesVisitRootsStackVisitor::WalkStackDecodeGcMasksOnly.

Two A16 libart.so changes together break the GC synchronization machinery in ensureArtKnowsHowToHandleReplacementMethods and instrumentArtGarbageCollection:

  1. libart.so is now stripped — .symtab is gone but the library retains a .gnu_debugdata section (LZMA-compressed mini-debuginfo). Module.findSymbolByName reads .dynsym + .symtab and so returns null for Heap::CollectGarbageInternal and ConcurrentCopying::CopyingPhase on A16, silently skipping the hook install. Module.enumerateSymbols() does parse the mini-debuginfo, so the symbols are still reachable that way.

  2. A16 defaults to Concurrent Mark Compact (CMC) instead of Concurrent Copying. Even if we resolve ConcurrentCopying::CopyingPhase, it never fires under CMC, so replacement ArtMethods are never re-synchronized after compaction. MarkCompact::RunPhases is the CMC lifecycle-event equivalent.

Changes

  • Add a small resolveDebugdataSymbol(module, name) fallback that lazily caches Module.enumerateSymbols() per module, and plumb it into temporaryApi.find as a last resort after findExportByName / findSymbolByName. Same plumbing restores Heap::CollectGarbageInternal resolution transparently for all ART-symbol callers.
  • Reroute the two raw art.findSymbolByName(...) call sites in instrumentArtGarbageCollection and instrumentArtFixupStaticTrampolines through api.find so they pick up the mini-debuginfo fallback.
  • When CopyingPhase is inlined (also seen on A16+), fall back to ConcurrentCopying::RunPhases as the hook point — one level up in the same phase-driver function.
  • Additionally hook MarkCompact::RunPhases with the existing artController.hooks.Gc.copyingPhase callback. The callback is collector-agnostic — it just synchronizes entrypoints at a "world is consistent again" lifecycle point — so reusing it for CMC is correct. Both hooks can coexist; only the active collector dispatches its phase.

Net diff: lib/android.js +38 / −4.

Testing

Reproduced #387 on a Cuttlefish x86_64 guest running aosp-android-latest-release (build 15150359, API 36, same libart BuildId class as the Pixel 7 reporter's build). Unpatched 7.0.13: HeapTaskDaemon SIGSEGV within seconds of attaching a .implementation hook to any hot constructor (e.g. java.net.URL.<init>(String) or java.io.File.<init>(String)).

With this patch applied, the following hook set runs to completion simultaneously on the same target for a full 90s analysis window:

  • .implementation swaps on java.net.URL.<init>(String) and java.io.File.<init>(String)
  • .implementation swaps on all 17 java.lang.StringFactory.newString* overloads
  • Interceptor.attach on art::mirror::String::AllocFromModifiedUtf8 (3 overloads) and AllocFromUtf16
  • Interceptor.attach on libc __system_property_get, __system_property_find, open, fopen*, freopen*, stat
  • Interceptor.attach on libdl dynamic-loader exports
  • Interceptor.attach on libart/libdexfile dex-retrieval paths

Observed: zero tombstones, no DecodeGcMasksOnly frames, full MITM / logcat / media / trace artifact set collected, hooks actively firing (600+ __system_property_get rewrites, 121 File.<init> callbacks, 60 MessageDigest callbacks in one run).

Notes

  • No behavior change on pre-A16 builds: the new MarkCompact lookup simply returns null there via api.find, and the mini-debuginfo fallback is a no-op when findSymbolByName already succeeds.
  • WeakMap-keyed symbol cache so entries die with the module.
  • Kept Thread::RunFlipFunction hook as-is — still exported, still correct on CC builds.

Android 16 ships libart.so without .symtab, so findExportByName and
findSymbolByName miss internal ART symbols. Parse .gnu_debugdata via
enumerateSymbols() and cache the result per module.

Also attach the GC synchronize-on-leave hook to MarkCompact::RunPhases
for Android 16's Concurrent Mark Compact collector, and fall back to
ConcurrentCopying::RunPhases when CopyingPhase is inlined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kolektori Kolektori marked this pull request as ready for review April 23, 2026 08:55
Comment thread lib/android.js
if (byName === undefined) {
byName = new Map();
try {
for (const sym of module.enumerateSymbols()) {
Copy link
Copy Markdown
Member

@oleavr oleavr May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed write-up.

Before merging I want to understand the findSymbolByName vs enumerateSymbols split, because in Gum both go through the same gum_elf_module_enumerate_symbols, which already falls back to .gnu_debugdata when .symtab is missing. So in principle they shouldn't disagree.

A few questions:

  1. Which frida / frida-server version on the A16 device? The mini-debuginfo fallback landed in gum 8ed32c4d (Dec 2024), the dynsym fallback in 01eadbff (Mar 2026).
  2. Does findSymbolByName('libart.so', '_ZN3art2gc4Heap22CollectGarbageInternalENS0_9collector6GcTypeENS0_7GcCauseEbj') start working after an
    enumerateSymbols() pass on the same module? That would point at a state/ordering bug in Gum.
  3. readelf -S on the device's libart.so — which of .symtab / .dynsym / .gnu_debugdata are present?

If it's a Gum bug I'd rather fix it there.

@radubogdan2k
Copy link
Copy Markdown

Checked on Samsung S25 / Android 16 BP4A.251205.006 (Snapdragon 8 Elite, arm64) — applied this patch on top of frida-java-bridge@7.0.13

Summary
The patch as it stands is necessary but not sufficient on this build:

  1. The two findSymbolByName → api.find swaps and the new .gnu_debugdata fallback are 100% required — without them none of the existing GC hooks (Heap::CollectGarbageInternal onLeave, Thread::RunFlipFunction onEnter, ConcurrentCopying::CopyingPhase onLeave) ever attach on this build. libart's .symtab is stripped; only .gnu_debugdata carries the internal symbols. Confirmed by nm /tmp/libart.dbg after objcopy --dump-section .gnu_debugdata=… + xz -d.
  2. The MarkCompact::RunPhases hook with onLeave only does not prevent the documented DecodeGcMasksOnly+0x?? → ReferenceMapVisitor::VisitFrame+0x?? → CheckpointMarkThreadRoots::Run crash. Crashes consistently within RunPhases's call to CheckpointMarkThreadRoots, before onLeave ever fires. Tried adding onEnter with on_leave_gc_concurrent_copying_copying_phase (the access-flags-only sync) and separately with the full synchronize_replacement_methods (access flags + quick_code rewire) — neither changed the symptom. The stack walker reads the OatQuickMethodHeader regardless of replacement-method access-flags state on BP4A.
  3. Root cause from disassembly (art::CodeInfo::DecodeGcMasksOnly, offset +48):
    ldr w8, [x0] ; first instruction that touches the OatQuickMethodHeader
  4. Crashes with fault addr 0x0 because on_art_method_get_oat_quick_method_header in the existing CModule returns NULL for replacement methods. Older ART checked for null; BP4A's DecodeGcMasksOnly does not — it unconditionally dereferences.
  5. Symptomatic workaround that does eliminate this crash: from the user agent, Interceptor.attach DecodeGcMasksOnly's entry; in onEnter, if args[0].isNull(), swap to a Memory.alloc(4096) zero page. Function then reads zeros, computes x0 - *x0 = x0, calls the CodeInfo decoder on a zeroed blob → returns "no GC roots" → caller is happy. No longer hits DecodeGcMasksOnly. (This obviously belongs inside the bridge, not the agent — flagging in case it informs a better fix at the on_art_method_get_oat_quick_method_header level. The proper fix is presumably for on_art_method_get_oat_quick_method_header to return a pre-allocated safe-empty header instead of NULL.)
  6. A second, structurally similar crash is hiding behind the first one on this build:
    #00 art::JValue art::InvokeWithVarArgs<_jmethodID*>(...)+148/+156/+168 Cleanup & a few bug fixes regarding method re-implementation #1 art::JNI::CallStaticObjectMethodV+120 Process Crash when replacing the implementation of any Java method using ART #2 <app's own JNI code calling CallStaticObjectMethod>
  7. Fault addresses are a mix of null+small offsets (e.g. 0x68) and high-bits-garbled pointers (e.g. 0x0042164070379c40, 0xb091bb17, 0x0000000df00640fa) — type-confusion pattern. Triggered on background worker threads (AppShared-Cpu-*, AppStartUp-1) by the app's own native code doing routine (*env)->CallStaticObjectMethod(...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Android 16] App crashes during Garbage Collection when Frida is attached (Build BP4A.251205.006)

3 participants