Skip to content

Add NV12 output: GPU shader-based RGBA→NV12 conversion (2.8x speedup on Apple Silicon)#25

Closed
djj0s3 wants to merge 89 commits into
projectM-visualizer:masterfrom
djj0s3:feat/glmemory-output
Closed

Add NV12 output: GPU shader-based RGBA→NV12 conversion (2.8x speedup on Apple Silicon)#25
djj0s3 wants to merge 89 commits into
projectM-visualizer:masterfrom
djj0s3:feat/glmemory-output

Conversation

@djj0s3

@djj0s3 djj0s3 commented Apr 24, 2026

Copy link
Copy Markdown

Summary

When the negotiated downstream format is NV12, projectm now does the RGBA→NV12 conversion on the GPU via two GLSL passes against its existing RGBA FBO, then ReadPixels each plane straight into the GstVideoFrame.

This eliminates the downstream `videoconvert ABGR→NV12` CPU step that dominated render time on Apple Silicon. End-to-end render measured against the same audio (6 min track, 1080p30, vtenc_h264):

```
Before: 363s audio → 842s render (2.3x realtime)
After: 363s audio → 300s render (0.8x realtime)
```

2.8x speedup. Render now completes in less time than the audio plays.

How it works

Two shader passes after `projectm_opengl_render_frame_fbo`:

  1. Y pass — full-resolution into a R8 texture, BT.601 luma coefficients
  2. UV pass — half-resolution into a RG8 texture; linear-filtered sampling of the source RGBA texture automatically averages 2x2 blocks for 4:2:0 chroma subsampling

`ReadPixels` then pulls each plane straight into the GstVideoFrame's NV12 plane data with `GL_PACK_ROW_LENGTH` accounting for stride alignment.

Caps

```
video/x-raw, format = { ABGR, NV12 }
```

ABGR path is unchanged for any consumer that doesn't accept NV12. `vtenc_h264` advertises NV12 in its sink caps, so the new path is auto-selected when vtenc is downstream.

Implementation gotchas

  • VAO is mandatory under GL 3.2 core (macOS Cocoa). Without one, every draw fails with `GL_INVALID_OPERATION`. We bake the vertex attribute setup into the VAO once and rebind per-pass.
  • Shader uses `#version 150 core` — works on the GL 3.2 core profile macOS exposes via Tauri/Cocoa. GLES2 not implemented since this targets the local Mac renderer; production GPU pods use a separate `convert_cuda.sh` pipeline.
  • PBO async readback is bypassed in NV12 mode — it was a workaround for the slow CPU conversion, which no longer exists.

Testing

  • Standalone gst-launch with `audiotestsrc → projectm format=NV12 → vtenc_h264 → mp4mux` produces valid 4:2:0 yuv420p H.264 output (verified with ffprobe).
  • Pixel sample of decoded frame shows reasonable color values (not clipped, not all black/white).
  • End-to-end bundled-app render with 6 min audio: visually correct output, 2.8x speedup vs ABGR baseline.

🤖 Generated with Claude Code

djj0s3 and others added 30 commits September 22, 2025 20:30
  control (pass=cbr) was ineffective for
  ProjectM's highly complex visual content
  2. Switching to quality-based encoding: Using
   quantizer=35 with CRF instead of fixed
  bitrate
  3. Adding quality constraints: qp-max=50 to
  prevent quality from degrading too much
  4. Optimizing for speed:
  speed-preset=ultrafast for faster encoding
- Log stdout/stderr from convert.sh on both success and failure
- Add environment diagnostics to convert.sh (GPU detection, GStreamer plugin check)
- Add pre-flight checks for file permissions and accessibility
- Improve error visibility in Runpod logs

This should help identify why jobs are failing with exit code 1.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update OpenGL version to 4.5 for better compatibility
- Add explicit GStreamer plugin paths and scanner location
- Respect Runpod's NVIDIA_VISIBLE_DEVICES setting (don't override)
- Add LD_LIBRARY_PATH to ensure libraries are found
- Improve NVIDIA driver capabilities configuration

These changes should resolve GPU access and library loading issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Added gstreamer1.0-gl package to Dockerfile dependencies
- Provides glcolorconvert and gldownload elements needed for OpenGL texture conversion
- Resolves "no element glcolorconvert/gldownload" pipeline errors
- Built and pushed as v3 and latest tags

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Clean up stale X lock files before starting Xvfb
- Kill existing Xvfb processes on display 99
- Enable GLX extension in Xvfb for better GL compatibility
- Use GLX platform instead of X11 for software rendering
- Improve gpu_accessible() to test nvidia-smi functionality
- Add sleep to ensure Xvfb is ready before use

These changes resolve the "Server is already active for display 99" error
and improve GPU detection when nvidia-smi works but devices aren't exposed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add explicit video/x-raw(memory:GLMemory),format=RGBA caps
- Ensures proper capability negotiation in headless EGL mode
- Resolves "could not link projectm0 to glcolorconvertelement0" error

The pipeline now explicitly specifies RGBA format at each GL stage:
projectm -> RGBA(GLMemory) -> glcolorconvert -> RGBA(GLMemory) -> gldownload -> RGBA

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- ProjectM plugin only supports ABGR format output
- Removed explicit format=RGBA caps that were causing negotiation failure
- Let glcolorconvert and videoconvert handle format conversion automatically
- Resolves "projectm0 can't handle caps format=(string)RGBA" error

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- ProjectM GL context fails in headless EGL mode on Runpod
- Always use Xvfb for ProjectM rendering (works reliably with X11 GL)
- Detect GPU separately for hardware encoding (nvh264enc)
- Maintains best of both: stable rendering + GPU-accelerated encoding

This resolves the persistent "could not link projectm0 to glcolorconvertelement0"
errors caused by GL context initialization failures in headless EGL mode.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Calculate display number based on PID: DISPLAY_NUM = 99 + (PID % 100)
- Prevents conflicts when multiple jobs run simultaneously
- Each job gets its own X display (range :99 to :198)
- Removes only the specific lock file for this display

Resolves issues with concurrent jobs interfering with each other's Xvfb instances.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change GST_GL_API from opengl to opengl3
- Resolves GL context creation error with Xvfb/Mesa
- Mesa provides opengl3 API, not legacy opengl
- Fixes: "Cannot create context with user requested api (opengl)"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
djj0s3 and others added 23 commits January 27, 2026 21:50
Method 1: Use Xorg with modesetting driver + glamor acceleration
- Works through DRM/KMS with nvidia-container-runtime
- Uses xorg-nvidia.conf which enables GPU acceleration
- More reliable than Xvfb + NVIDIA GLX which requires server-side support

Method 2: Xvfb + NVIDIA GLX (kept as fallback)
- Only works when NVIDIA GLX server modules are available

Both methods test with glxinfo before proceeding.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… frames

Root cause: projectm_opengl_render_frame() renders to ProjectM's internal
buffer, not our external FBO. This caused all frames to be black.

Fix: Use projectm_opengl_render_frame_fbo(handle, fbo_id) when an FBO is
available. This renders directly to our framebuffer object.

Also improved convert.sh GPU initialization:
- Add GPU environment diagnostics for debugging
- Reject llvmpipe/software rendering (causes black frames with gst-projectm)
- Make Xvfb + NVIDIA GLX the preferred method for Vast.ai
- Remove DRI requirement for Method 2 (GLX works without DRI access)
- Add detailed EGL device enumeration for container environments

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes:
- Switch base image from ubuntu:24.04 to nvidia/cuda:12.2.0-devel-ubuntu22.04
- Install nv-codec-headers for NVENC/NVDEC support
- Build gst-plugins-bad from source with nvcodec=enabled
- Add libnvidia-encode/decode libraries
- Include 'video' capability in NVIDIA_DRIVER_CAPABILITIES
- Update GST_PLUGIN_PATH to include nvcodec plugin location

This enables hardware H.264 encoding via nvh264enc, which is ~2x faster
than software x264 encoding and offloads work from the CPU to the GPU's
dedicated video encoding hardware (NVENC).

Combined with the mesh optimization (640x480 → 220x140), this should
enable faster-than-realtime rendering for long audio files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use nvidia/cuda:12.2.0-devel-ubuntu22.04 base image
- Install nv-codec-headers for NVENC/NVDEC
- Build nvcodec GStreamer plugin from gstreamer 1.20.7 monorepo
- Add libnvidia-encode/decode libraries
- Include 'video' capability for NVENC access

The nvh264enc plugin enables hardware H.264 encoding, offloading
encoding from CPU to GPU's dedicated NVENC hardware for ~2x faster
video encoding.

Image size: 8.79GB (larger due to CUDA devel libraries)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When running in Docker with -e DISPLAY=:0 -v /tmp/.X11-unix:/tmp/.X11-unix,
the container should use the host's X server instead of starting its own.

This enables:
- NVIDIA GPU rendering via host Xorg with NVIDIA driver
- NVENC hardware encoding (host GPU access)
- Proper FBO rendering (no black frames)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Lambda Labs and other compute-focused cloud instances have CUDA but
not OpenGL by default. This change:

- Attempts to install libnvidia-gl for EGL/GLX support
- Creates /usr/share/glvnd/egl_vendor.d/10_nvidia.json so libglvnd
  can find NVIDIA's EGL implementation

With this, the container can use GPU-accelerated OpenGL rendering
when nvidia-container-toolkit injects the host's NVIDIA libraries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed libnvidia-encode-525 and libnvidia-decode-525 packages.
These caused NVENC to fail with "unsupported device" when the host
runs a different driver version (e.g., 570 vs 525).

Kept libnvidia-gl for ProjectM OpenGL rendering (EGL/GLX).

nvidia-container-toolkit will inject the correct encode/decode
libraries at runtime when NVIDIA_DRIVER_CAPABILITIES=video is set.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The easter-egg property controls a startup logo/feature that shows
the ProjectM W logo. Setting it to 0 disables this.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced projectM's default logo textures with user's custom VJ logo.
Added multiple filename variations to cover all possible projectM
texture references:
- M.tga, m_logo.tga, mlogo.tga
- projectm.tga, project.tga
- headphones.tga
- spiral.tga
- logo.tga
- pM.tga

These will be included in the Docker image and override any default
projectM logos that appear during idle/startup.
- Add vj_studio_logo.png for "Made With VJ Studio" overlay
- Enable faststart=true on mp4mux for better YouTube streaming

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Load first preset immediately on init to avoid showing idle screen
- Add gst_projectm_load_first_timeline_preset() for timeline mode
- Prevent timeline_activate from resetting to index -1 if first preset already loaded
- Add COPY for vj_studio_logo.png in Dockerfile

This fixes the issue where the ProjectM "M" logo would briefly appear
at the start of videos before transitioning to the first real preset.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cropped top portion of logo to remove "Made With" text,
leaving just the VJ character and "STUDIO" for a cleaner
bottom-right watermark appearance.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add elapsed_seconds to timeline switch log message so we can see
  the actual PTS value when each switch occurs
- Add periodic PTS diagnostic (every 600 frames / ~10s) logging both
  audio and video buffer PTS to detect drift between them
- Add render_frame_count to GstProjectMPrivate for frame tracking

This helps diagnose an issue where timeline entries get skipped,
possibly due to video PTS drifting ahead of audio PTS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When CPU encoding is used (x264enc fallback), video PTS runs at
0.5-0.7x of audio time, causing the timeline engine to skip entries.
This resulted in only 90/190 timeline entries being visited for a
53-min DJ set.

Audio PTS advances at the true playback rate regardless of video
encoding speed, ensuring all timeline entries are visited correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tr_array_sort

g_ptr_array_sort() passes each comparison argument as a pointer to the
array slot (GstProjectMTimelineEntry**), not a direct pointer to the
entry. Without the extra dereference, the comparator was interpreting
raw memory addresses as gdouble start_time values, resulting in a
semi-random sort order.

This caused large sections of the timeline to be unreachable — the
fast-path optimization in timeline_find_target_index() would stay stuck
on an early index because the "next" entry in the corrupted sort order
had a much later start_time, making the before_next check always true.

Symptoms: only ~89 of 190 timeline entries visited during a 53-min
DJ set render, with 9-17 minute gaps where the same preset played.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Helps verify the sort comparator fix is working by logging
start_time/duration/end_time of the first 20 entries after
g_ptr_array_sort in gst_projectm_load_timeline().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
G_DEFINE_TYPE_WITH_CODE was initializing the debug category as
"gstprojectm" while plugin_init used "projectm". Since the type
init runs AFTER plugin_init (via gst_element_register), it overwrote
the category variable with "gstprojectm" which didn't match the
GST_DEBUG=projectm:4 setting, causing INFO-level diagnostic messages
(PTS tracking, sort order verification) to be suppressed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use GST_WARNING_OBJECT instead of GST_INFO_OBJECT for timeline
diagnostics so they appear regardless of debug category threshold.
Includes "build v62" marker to verify correct binary is running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The build v62 WARNING-level markers were temporary debugging aids to verify
the timeline sort fix on RunPod. Now confirmed working (190/190 entries
visited), downgrade back to INFO level for production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… path detection

- build.sh: Check ~/.local and /opt/homebrew for ProjectM 4 headers before /usr/local;
  add PKG_CONFIG_PATH for Apple Silicon
- setup.sh: Add gst-plugins-base, gst-plugins-good, gst-plugins-bad, ffmpeg to brew list
- convert.sh: Add is_macos() helper; detect macOS and use native CGL/Cocoa OpenGL
  (skip X11/VirtualGL entirely); add vtenc_h264 encoder selection and pipeline;
  auto-detect preset paths for macOS (~/.local, /opt/homebrew); fix stat command
  portability for output monitoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… g_error)

On macOS, some community presets trigger transient FBO errors during shader
compilation. These errors are recoverable — ProjectM continues rendering
on the next frame. Previously, g_error() called abort(), crashing the
entire pipeline. Now it logs a warning and continues.

This fixes the crash when using preset= property on macOS with the full
10k+ community preset library.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When negotiated downstream format is NV12, projectm now does the
RGBA→NV12 conversion on the GPU via two GLSL passes against its
existing RGBA FBO:

  1. Y pass:  full-resolution into an R8 texture using BT.601 luma
              coefficients (0.257*R + 0.504*G + 0.098*B + 16/255)
  2. UV pass: half-resolution into a RG8 texture, linear-filtered
              sampling automatically averages 2x2 blocks for 4:2:0
              chroma subsampling

ReadPixels then pulls each plane straight into the GstVideoFrame's
NV12 plane data — no intermediate buffer copies.

This eliminates the downstream `videoconvert ABGR→NV12` CPU step
that was the dominant cost on Apple Silicon. End-to-end render
went from 2.3x realtime to 0.8x realtime (363s audio: 842s → 300s)
on the same hardware/preset/audio.

Implementation notes:

- Caps: `video/x-raw, format = { ABGR, NV12 }` — caps negotiation
  picks NV12 when downstream advertises it (e.g. vtenc_h264).
- ABGR path remains unchanged for non-NV12 consumers.
- VAO is required by GL 3.2 core (macOS Cocoa context); without one
  every draw fails with GL_INVALID_OPERATION.
- Shader uses `#version 150 core` — compatible with the GL 3.2 core
  profile macOS exposes. GLES2 path not implemented since this
  optimization targets the local Mac renderer; production GPU pods
  use a different convert_cuda.sh pipeline.
- PBO async readback is bypassed in NV12 mode — it was a workaround
  for the slow CPU conversion, which no longer exists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
djj0s3 and others added 2 commits April 24, 2026 09:14
From PR #3 review (P1):
- nv12_render now restores projectm's FBO id (not framebuffer 0) so
  headless EGL contexts don't raise GL_INVALID_OPERATION. Matches the
  rule the existing ABGR path already follows.
- nv12_init cleans up allocated textures + FBOs on the completeness-
  check early-return paths (previously leaked if R8/RG8 unsupported).
- GL_PACK_ALIGNMENT now reset to default (4) after ReadPixels so later
  code in the same context gets clean state back.
- GL_PACK_ROW_LENGTH comments clarified: values are in pixels, not bytes.
  Y plane (R8) happens to have identical byte/pixel values but is now
  computed explicitly so the unit is obvious.

P2:
- BT.601 studio-swing (limited-range) coefficients documented as
  intentional; pc-range would desync vtenc_h264's color_range=tv output.
- Removed per-frame TexParameteri mutations on the source FBO texture.
  The 4:2:0 subsampling comes from rendering source-sized quad into a
  half-sized viewport — hardware bilinear on the source texture
  (GL_LINEAR set once at FBO creation) averages 2x2 blocks naturally.
- Dropped the inverted early-exit in nv12_release; per-resource null
  guards already handle partial-init cleanup.

Smoke test post-fixes: same synthetic audiotestsrc → NV12 → vtenc_h264
pipeline still produces valid 4:2:0 H.264 output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per re-review: both nv12_init and nv12_render had `else { bind 0 }`
fallback branches after the `priv->fbo_id != 0` check. nv12_render is
only called from the main frame handler when nv12_mode is on AND
gst_projectm_ensure_render_target returned using_fbo=TRUE — which
guarantees fbo_id is non-zero. The else-branches were dead code that
would, in their one reachable case, do exactly what the comments warn
against: bind framebuffer 0 in headless EGL where it doesn't exist.

Removed the fallback, made the bind unconditional. Comments updated
to explain why no fallback is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@revmischa

revmischa commented Apr 28, 2026

Copy link
Copy Markdown
Member

please don't commit every preset here

@djj0s3

djj0s3 commented Apr 28, 2026

Copy link
Copy Markdown
Author

Oh sorry! I thought I had forked this work out. Oof, I'll fix. Sorry for the unnecessary noise.

@djj0s3

djj0s3 commented Apr 28, 2026

Copy link
Copy Markdown
Author

Closing — this PR was opened against the wrong repo by mistake. The change lives in our private fork (djj0s3/gst-projectm) and isn't intended for upstream. Apologies for the noise.

@djj0s3 djj0s3 closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants