I like taking apart big, intimidating models until I understand exactly why they work β then putting the smallest runnable version back together so other people can too.
These days that means embodied AI: vision-language-action policies, diffusion / flow matching, and 3D vision. I'm a direct-PhD student at Peking University, but most of what I actually do lives in the repos below.
A recurring theme in my work: complex papers shouldn't stay locked inside their original codebases.
- π€ pi-zero-minimal β Physical Intelligence's Ο0, stripped down to a minimal runnable VLA. No engineering scaffolding, just the core idea you can read in one sitting.
- π§© CV_Milestones β clean-room re-implementations of landmark papers (DiT, 3DGS, β¦) β the version I wish existed when I was first reading them.
And when re-implementing isn't enough, I write things down:
- π Flow-Diffusion β diffusion, score matching, and SDEs, all derived from one flow-matching lens. The unified picture I wanted but couldn't find.
- π 3D_Vision β camera models β epipolar geometry β SfM, built from the ground up.
On the research side, I've spent time on:
- a diffusion foundation model unifying five image-restoration tasks β and rebuilding its sampler with flow matching for a ~30Γ speedup (co-first author, Nature Communications)
- using a VLM as a reward signal to make few-step text-to-image models actually follow instructions, trained end-to-end and differentiable
- feed-forward 3D reconstruction with Transformers, self-supervised where ground truth doesn't exist
I'm always up for a good conversation about generative models, embodied agents, or why your sampler is slow.