Blurface is a cross-platform command-line tool — and a tiny Python library — that blurs every human face in an MP4 video with a fully GPU-accelerated PyTorch pipeline. The default detector is YOLOv8-face via ultralytics (a state-of-the-art single-stage detector, robust on moving and partially-occluded faces); a lighter facenet-pytorch MTCNN backend is available as a fallback. The pixel mosaic is computed on the GPU with torch.nn.functional.interpolate, and the original audio track is re-muxed back into the output via ffmpeg. A built-in evaluation module emits a CSV, a JSON metrics report, and six PNG plots so you can quantify every run.
- Pure PyTorch, end-to-end. No TensorFlow anywhere on the hot path. Detection and mosaic both live on the same
torch.device. - State-of-the-art detector for motion. Default backend is YOLOv8-face — single forward pass per frame, low jitter on moving faces, no
transformersimport noise. - Cross-platform GPU acceleration. Auto-selects CUDA on Windows / Linux, MPS on Apple Silicon, CPU otherwise — with graceful fallback.
- Batched inference + FP16. Set
--batch-sizeto whatever your GPU can hold; add--halffor FP16 on CUDA. - Rectangular or elliptical mosaic with a configurable block size.
- Audio passthrough via the
ffmpegCLI (preferred) orffmpeg-python(fallback). - Built-in evaluation. Per-frame metrics CSV + JSON summary + six PNG plots and an optional CPU-vs-GPU benchmark.
- Three console scripts.
blurface,blurface-eval, andblurface-install-gpuare registered on install.
- Blurface
Blurface targets Python ≥ 3.9 and is verified on Windows, Linux, and macOS.
# Recommended: a clean conda env
conda create -n blurface python=3.11 -y
conda activate blurfaceThis is the single most common failure point. The default pip install torch on Windows installs the CPU build, which is why --device cuda would otherwise refuse to run.
NVIDIA GPU (recommended) — CUDA 12.1 wheels:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121Newer GPUs (e.g. RTX 50-series / Blackwell,
sm_120compute): Standard CUDA 12.1/12.4 builds will lack your GPU's kernel architecture and crash withCUDA error: no kernel image is available. Install the PyTorch nightly bundled with CUDA 13.0 (or newer):pip install --pre torch torchvision \ --index-url https://download.pytorch.org/whl/nightly/cu130 --upgrade
If your NVIDIA driver is older, you may need cu118 instead. Check with nvidia-smi and the official PyTorch install matrix.
Apple Silicon (MPS):
pip install torch torchvisionCPU only:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpuFrom PyPI:
pip install blurfaceOr from a git clone (editable):
git clone https://github.com/Ezharjan/blurface.git
cd blurface
pip install -e .This pulls ultralytics, opencv-python, ffmpeg-python, matplotlib, pandas, tqdm, Pillow, … and registers three console scripts: blurface, blurface-eval, blurface-install-gpu.
Optional MTCNN fallback backend:
pip install "blurface[mtcnn]"The audio re-mux step needs the ffmpeg binary on PATH:
| Platform | Command |
|---|---|
| Windows | choco install ffmpeg (or download from https://ffmpeg.org/download.html and add ffmpeg.exe to PATH) |
| macOS | brew install ffmpeg |
| Linux | sudo apt install ffmpeg |
If ffmpeg isn't available the pipeline still produces a video-only MP4 — it just skips the audio.
After installation, run the diagnostic:
blurface-install-gpuYou should see something like:
========================================================================
PyTorch
========================================================================
torch : 2.4.1+cu121
CUDA build : 12.1
cuda avail. : True
device[0] : NVIDIA GeForce RTX 4090 (sm_89, 24.0 GB)
If cuda avail. is False but nvidia-smi works, you're on the CPU build of torch — repair it with:
blurface-install-gpu --fix --cuda 12.1The same script also accepts --cpu (force CPU wheels) and --nightly (use the PyTorch nightly index for very new architectures).
blurface <input.mp4> [options]The most common flags:
| Flag | Default | Description |
|---|---|---|
input |
— | Path to the input MP4 video (required). |
--output, -o |
<stem><YYMMDDHHMM>.mp4 |
Output file path. |
--mosaic-size, -m |
10 |
Mosaic block size in pixels; higher = coarser blur. |
--blur-shape, -s |
ellipse |
ellipse or rectangle. |
--device, -d |
auto |
auto, cuda, mps, or cpu. |
--backend |
auto |
auto (→ yolo), yolo, or mtcnn. |
--batch-size, -b |
8 |
Frames per detection batch. |
--half |
off | FP16 inference on CUDA. |
--confidence, -c |
0.5 |
Minimum face confidence in [0, 1]. |
--imgsz |
640 |
YOLO inference image size. Raise for tiny faces, lower for speed. |
--min-face-size |
20 |
MTCNN minimum face edge in px. |
--model-path |
— | Local YOLO-face .pt file (skips the download). |
--model-url |
— | Custom URL for YOLO-face weights. |
--no-cpu-fallback |
off | Hard-fail when CUDA/MPS is requested but unavailable. |
--report |
— | Path for a JSON metrics report. |
--plots-dir |
— | If set, evaluation PNGs and CSV are written here. |
--quiet / --verbose |
off | Lower / raise the log level. |
--version |
— | Print the installed version and exit. |
Run blurface --help for the full reference and worked examples.
# 1. Defaults: ellipse mosaic, auto device, YOLOv8-face detector.
blurface input.mp4
# 2. Force CUDA, FP16, larger batch, custom output path.
blurface input.mp4 -d cuda -b 32 --half -o out/blurred.mp4
# 3. Coarser rectangular mosaic (block size 20).
blurface input.mp4 -m 20 -s rectangle
# 4. Use the MTCNN fallback backend (needs the [mtcnn] extra).
blurface input.mp4 --backend mtcnn
# 5. Emit a full JSON metrics report and a directory of PNG plots.
blurface input.mp4 --report out/report.json --plots-dir out/plots
# 6. Provide your own YOLO-face weights (skips the download).
blurface input.mp4 --model-path /path/to/yolov8n-face.pt
# 7. Raise the inference image size for lots of tiny faces.
blurface input.mp4 --imgsz 1280 --batch-size 4
# 8. Full evaluation: report + plots + CPU-vs-GPU benchmark
blurface-eval video.mp4 --output D:\blurface\out\blurred.mp4 --report-dir D:\blurface\out\report --device auto --batch-size 8 --benchmark --benchmark-frames 120from blurface import FaceMosaicProcessor
from blurface.evaluate import render_plots
proc = FaceMosaicProcessor(
device="auto", # cuda > mps > cpu, with fallback
backend="yolo", # or "mtcnn", or "auto"
batch_size=16,
half=True, # FP16 on CUDA (no-op elsewhere)
imgsz=640,
confidence=0.5,
)
report = proc.process_video(
"input.mp4", "output.mp4",
report_path="out/report.json",
collect_metrics=True,
)
render_plots(report, "out/plots")
print(f"{report.realtime_fps:.1f} fps on {report.device} ({report.backend})")Public objects re-exported from the top-level package:
FaceMosaicProcessor— the pipeline.RunReport,FrameMetric— dataclasses returned byprocess_video.select_device(preferred, allow_cpu_fallback)— the device picker.describe_device(device)— human-readable device label.build_detector(...),YoloFaceDetector,MtcnnDetector— detection backends.
The video is processed in five clearly-separated stages, kept on the same torch.device to avoid host round-trips:
- Decode (CPU).
cv2.VideoCapturereads MP4 frames as BGRuint8numpy arrays. Frames are accumulated into a list of length--batch-size. - Detect (device). The batch is converted to RGB and handed to the active detector backend. The detector returns, per frame, an
(N, 4)array of[x1, y1, x2, y2]boxes in original pixel space and an(N,)array of confidences. - Mosaic (device). Each frame is uploaded once to the device as a CHW float tensor (FP16 if
--half). For every box:- the cropped face region is down-sampled to
mosaic_size × mosaic_sizewithF.interpolate(mode="bilinear", align_corners=False); - it is then up-sampled back to the box size with
F.interpolate(mode="nearest")— that's the classic pixelation effect, computed in a single bilinear + nearest kernel pair; - for
blur_shape="ellipse"an inscribed elliptical mask is built on-device ((x − cx)² / rx² + (y − cy)² / ry² ≤ 1) and the mosaic is alpha-blended over the original — only the elliptical region is replaced, the corners of the bounding box are preserved.
- the cropped face region is down-sampled to
- Encode (CPU). The blurred frame is clamped, cast back to
uint8, transposed to HWC, copied to the CPU, and written to a temporarymp4v-encoded MP4 withcv2.VideoWriter. - Mux (FFmpeg). Finally
ffmpegre-encodes the temporary video as H.264 (libx264, CRF 20,mediumpreset) and stream-copies the original audio track with-c:a copy -map 0:v:0 -map 1:a:0?. The audio is preserved bit-for-bit — no re-encoding, no quality loss, same codec / bitrate / sample rate as the source. If stream-copy is rejected (rare; happens when the source audio codec isn't allowed in the MP4 container, e.g. PCM) Blurface falls back to a 192 kbit/s AAC re-encode.ffprobethen verifies the output actually contains audio when the source did — mismatches raise rather than silently producing a muted file. Ifffmpegis missing and the source has audio, Blurface fails loudly with install instructions instead of dropping the audio.
Throughout the run, optional per-frame metrics (detect / mosaic latency, GPU memory, face counts, mean confidence) are collected into a RunReport, which render_plots turns into PNG charts and a CSV.
┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────┐
│ decode │ → │ detect │ → │ mosaic │ → │ encode │ → │ mux │
│ (cv2) │ │ (YOLO/MTCNN) │ │ (torch.F) │ │ (cv2) │ │ (ffmpeg) │
│ CPU │ │ device │ │ device │ │ CPU │ │ CPU │
└──────────┘ └──────────────┘ └──────────────┘ └──────────┘ └──────────┘
│ │
▼ ▼
per-frame metrics ──→ RunReport ──→ CSV / JSON / PNG plots
--batch-sizeis the single biggest lever once CUDA is enabled. Raise it until you hit your GPU's memory limit.--halfroughly halves the detector's memory footprint on CUDA and is faster on Ampere/Ada/Hopper. It has no effect on CPU or MPS.--imgsztrades detector accuracy for speed. Default 640 is a good compromise; 1280 helps on tiny faces in 4K footage; 480 is markedly faster on tight latency budgets.--mosaic-sizeis not a speed knob — the down-sample target is tiny either way — but it changes the visual effect. 4–8 = strongly recognisable as pixelation; 12–20 = blocky, friendlier on small faces; 30+ = single coloured patch.
Blurface ships two interchangeable backends with the same detect(frames_rgb) API.
A single-stage anchor-free detector built on Ultralytics' YOLOv8 backbone, fine-tuned on a face-detection dataset. Why it is the default:
- Single forward pass per frame. Detection is a single conv-net evaluation, so latency stays flat as the number of faces grows. Cascade detectors (MTCNN, Haar, etc.) keep proposing and refining candidates, which inflates per-frame cost on busy scenes.
- Robust to motion blur, profile angles and partial occlusion. The anchor-free head and the deep backbone learn richer face priors than the small classification networks inside MTCNN's P/R/O stages.
- Lower jitter across frames. Because the model is deeper and operates at a single scale per call, box positions are noticeably more stable from frame to frame than MTCNN's, giving smoother mosaics in the output.
- GPU-friendly. Batched inference on CUDA is the design point; FP16 is a one-flag switch.
Weights (yolov8n-face.pt, ~6 MB) are downloaded once from the akanametov/yolo-face release into ~/.cache/blurface/ and reused on subsequent runs. Override with --model-path or --model-url.
A three-stage cascade detector (P-Net → R-Net → O-Net) from facenet-pytorch. Useful when:
- you cannot install
ultralytics(e.g. very old Python, restricted environments), - you want a second opinion on a hard clip,
- you specifically need MTCNN's facial landmark output (landmarks are computed internally but not exposed by Blurface today),
- you're CPU-only and prefer MTCNN's lighter memory footprint.
Trade-offs: MTCNN is slower per frame on GPU than YOLOv8-face, less robust on motion-blurred or sideways faces, and produces more frame-to-frame jitter. The --min-face-size flag is honoured only by this backend.
Install with pip install "blurface[mtcnn]".
Tries YOLOv8-face first; if its ultralytics import or weight download fails, falls back to MTCNN. This is the default.
blurface-eval runs the full pipeline and writes a complete report directory:
blurface-eval input.mp4 \
--output out/blurred.mp4 \
--report-dir out/report \
--device cuda --half --batch-size 16 \
--benchmark --benchmark-frames 240It accepts the same backend / device / mosaic options as blurface, plus --benchmark and --benchmark-frames N, which produce a CPU-vs-GPU bar chart on a short subclip. Run blurface-eval --help for the full reference.
The output directory ends up looking like:
out/report/
├── report.json # full RunReport (incl. per-frame metrics)
├── summary.json # aggregate scorecard
├── per_frame_metrics.csv # one row per processed frame
├── summary.png # text scorecard, ready to share
├── faces_per_frame.png # detections across the timeline
├── latency_per_frame.png # detect vs mosaic vs total latency
├── fps_rolling.png # rolling throughput vs source FPS
├── gpu_memory.png # allocated GPU memory (CUDA only)
├── confidence_histogram.png # distribution of per-frame mean confidence
└── benchmark/ # only with --benchmark
├── cpu_vs_gpu.png
├── cpu_vs_gpu.json
├── benchmark_cpu.mp4
└── benchmark_cuda.mp4
Every run produces, conceptually, three artefacts:
report.json— the fullRunReportdataclass: device, backend, source resolution / FPS, frames processed, processing FPS, total wall time, detect / mosaic / mux time breakdowns, total faces detected, average faces per frame, frames with faces, peak GPU memory, batch size, FP16 flag, mosaic configuration, confidence threshold, and the full per-frame metrics list.per_frame_metrics.csv— one row per processed frame with columns:frame_idx, num_faces, mean_confidence, detect_ms, mosaic_ms, total_ms, gpu_mem_mb.- PNG plots, each focused on a single question:
- faces_per_frame.png — how many faces were detected across the timeline.
- latency_per_frame.png — detect vs mosaic vs total latency per frame.
- fps_rolling.png — rolling throughput, overlaid with the source FPS line and the run's average processing FPS.
- gpu_memory.png — allocated GPU memory over time (CUDA only).
- confidence_histogram.png — distribution of per-frame mean detection confidences (on frames that had faces).
- summary.png — a monospaced text scorecard you can drop into a slide.
A standalone helper to inspect and repair your PyTorch install:
# 1. Diagnose only (the default)
blurface-install-gpu
# 2. Reinstall with the right wheels for your CUDA driver
blurface-install-gpu --fix --cuda 12.1
# 3. Very new architectures (RTX 50-series / Blackwell, sm_120)
blurface-install-gpu --fix --nightly --cuda 13.0
# 4. Force the CPU build
blurface-install-gpu --fix --cpuIt reports Python, conda env, platform, PyTorch version + CUDA build, every visible CUDA device (with its compute capability and memory), MPS availability on Apple Silicon, the NVIDIA driver via nvidia-smi, and whether ffmpeg is on PATH. With --fix, it pip uninstalls torch + torchvision and reinstalls them from the appropriate wheel index.
Run as a module too: python -m blurface.install_gpu.
A minimal pytest suite ships with the repo. It builds a tiny synthetic clip and runs the pipeline end-to-end on CPU — no GPU or face dataset required.
pip install pytest
pytest -qTests live in tests/test_pipeline.py.
RuntimeError: CUDA requested but no CUDA device is available.
Your installed torch is the CPU build. Repair with the bundled diagnostic:
blurface-install-gpu --fix --cuda 12.1…or manually:
pip uninstall -y torch torchvision
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121CUDA error: no kernel image is available for execution on the device
Your GPU's compute capability is newer than the CUDA version your PyTorch was built against (typical on RTX 50-series / Blackwell). Use the nightly + CUDA 13 wheels:
blurface-install-gpu --fix --nightly --cuda 13.0Disabling PyTorch because PyTorch >= 2.4 is required but found 2.2.2
That's a warning emitted by the transformers library when something else in your environment imports it. Blurface's default --backend yolo does not pull transformers in, so the warning is harmless. If you need --backend mtcnn with an old torch, upgrade torch (see above) or pin pip install "transformers<4.40".
ImportError: ultralytics is required for the YOLO backend.
pip install ultralytics — or simply pip install blurface, which already depends on it.
CUDA out of memory. Lower --batch-size, enable --half, or lower --imgsz.
No audio in the output. This should never happen silently in v0.2.0 — if the source has audio and ffmpeg can't preserve it, Blurface raises with install instructions. If you do see a muted output, first check: did the source have an audio track? (Run ffprobe -i your_input.mp4 and look for a Stream #0:1: Audio: line.) If the source genuinely has no audio, the muted output is correct. If the source does have audio and you got a muted output anyway, please file a bug at https://github.com/Ezharjan/blurface/issues.
macOS MPS warnings about unimplemented ops. Harmless — those ops automatically fall back to CPU.
The downloaded YOLO weights file is corrupted / partial. Delete ~/.cache/blurface/yolov8n-face.pt and let the next run re-download, or pass --model-path to use a known-good copy.
- Audio preservation (bug fix). Previously, three silent-failure paths in the mux step could quietly produce a muted output: the outer wrapper caught any ffmpeg error and copied the audio-less temp file, the ffmpeg-python fallback re-encoded video alone on failure, and even on the happy path the audio was re-encoded to AAC 192k (a quality loss). The mux now:
- Stream-copies the original audio (
-c:a copy) — preserved bit-for-bit, same codec / bitrate / sample rate as the source. No re-encoding. - Probes the source with
ffprobeto decide whether to expect audio at all. - Falls back to AAC 192k only if stream-copy is rejected by the MP4 container.
- Verifies the output actually contains audio when the source did; raises if not.
- Raises a clear, actionable error (with install instructions) when ffmpeg is missing and the source has audio, instead of silently dropping the track.
- Stream-copies the original audio (
- Packaging:
blurface-install-gpunow ships inside the installed package, so the console script works afterpip install(it was broken before). PyPI metadata (project_urls,keywords, fullclassifiers, MANIFEST,pyproject.toml) brought up to standard. - Pipeline: fixed an aggregation bug where
RunReport.total_faces_detected,frames_with_faces,detect_time_s, andmosaic_time_swere0whenprocess_video(..., collect_metrics=False). They are now tracked independently of the per-frame list. - Report: new
frames_processedandtotal_faces_detectedfields onRunReport;summary.jsonand the PNG scorecard updated to match. - CLI: richer
--helpoutput (epilog with worked examples), new--verboseflag, more actionable error messages, validated--confidencerange, cleaner exit codes (0/1/2/130). blurface-install-gpu: lists every visible CUDA device (with compute capability + memory), reportsffmpegpresence, gains--nightlyfor new architectures, gains a module form (python -m blurface.install_gpu).blurface-eval: aligned defaults withblurface(confidence 0.5, benchmark-frames 240), exposes--backend,--imgsz,--half,--quiet.- Public API: top-level package re-exports
select_device,describe_device,build_detector,YoloFaceDetector,MtcnnDetectoralongside the existingFaceMosaicProcessor,RunReport,FrameMetric. - Docs: README rewritten with explicit pipeline-internals and detection-methods sections.
- Initial public release: GPU PyTorch pipeline, YOLOv8-face + MTCNN backends, FFmpeg audio re-mux, evaluation plots,
blurfaceandblurface-evalCLIs.
MIT — see LICENSE.
Issues and PRs welcome at https://github.com/Ezharjan/blurface.