Add NV12 output: GPU shader-based RGBA→NV12 conversion (2.8x speedup on Apple Silicon)#25
Closed
djj0s3 wants to merge 89 commits into
Closed
Add NV12 output: GPU shader-based RGBA→NV12 conversion (2.8x speedup on Apple Silicon)#25djj0s3 wants to merge 89 commits into
djj0s3 wants to merge 89 commits into
Conversation
control (pass=cbr) was ineffective for ProjectM's highly complex visual content 2. Switching to quality-based encoding: Using quantizer=35 with CRF instead of fixed bitrate 3. Adding quality constraints: qp-max=50 to prevent quality from degrading too much 4. Optimizing for speed: speed-preset=ultrafast for faster encoding
- Log stdout/stderr from convert.sh on both success and failure - Add environment diagnostics to convert.sh (GPU detection, GStreamer plugin check) - Add pre-flight checks for file permissions and accessibility - Improve error visibility in Runpod logs This should help identify why jobs are failing with exit code 1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Update OpenGL version to 4.5 for better compatibility - Add explicit GStreamer plugin paths and scanner location - Respect Runpod's NVIDIA_VISIBLE_DEVICES setting (don't override) - Add LD_LIBRARY_PATH to ensure libraries are found - Improve NVIDIA driver capabilities configuration These changes should resolve GPU access and library loading issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Added gstreamer1.0-gl package to Dockerfile dependencies - Provides glcolorconvert and gldownload elements needed for OpenGL texture conversion - Resolves "no element glcolorconvert/gldownload" pipeline errors - Built and pushed as v3 and latest tags 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Clean up stale X lock files before starting Xvfb - Kill existing Xvfb processes on display 99 - Enable GLX extension in Xvfb for better GL compatibility - Use GLX platform instead of X11 for software rendering - Improve gpu_accessible() to test nvidia-smi functionality - Add sleep to ensure Xvfb is ready before use These changes resolve the "Server is already active for display 99" error and improve GPU detection when nvidia-smi works but devices aren't exposed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add explicit video/x-raw(memory:GLMemory),format=RGBA caps - Ensures proper capability negotiation in headless EGL mode - Resolves "could not link projectm0 to glcolorconvertelement0" error The pipeline now explicitly specifies RGBA format at each GL stage: projectm -> RGBA(GLMemory) -> glcolorconvert -> RGBA(GLMemory) -> gldownload -> RGBA 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- ProjectM plugin only supports ABGR format output - Removed explicit format=RGBA caps that were causing negotiation failure - Let glcolorconvert and videoconvert handle format conversion automatically - Resolves "projectm0 can't handle caps format=(string)RGBA" error 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- ProjectM GL context fails in headless EGL mode on Runpod - Always use Xvfb for ProjectM rendering (works reliably with X11 GL) - Detect GPU separately for hardware encoding (nvh264enc) - Maintains best of both: stable rendering + GPU-accelerated encoding This resolves the persistent "could not link projectm0 to glcolorconvertelement0" errors caused by GL context initialization failures in headless EGL mode. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Calculate display number based on PID: DISPLAY_NUM = 99 + (PID % 100) - Prevents conflicts when multiple jobs run simultaneously - Each job gets its own X display (range :99 to :198) - Removes only the specific lock file for this display Resolves issues with concurrent jobs interfering with each other's Xvfb instances. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change GST_GL_API from opengl to opengl3 - Resolves GL context creation error with Xvfb/Mesa - Mesa provides opengl3 API, not legacy opengl - Fixes: "Cannot create context with user requested api (opengl)" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Method 1: Use Xorg with modesetting driver + glamor acceleration - Works through DRM/KMS with nvidia-container-runtime - Uses xorg-nvidia.conf which enables GPU acceleration - More reliable than Xvfb + NVIDIA GLX which requires server-side support Method 2: Xvfb + NVIDIA GLX (kept as fallback) - Only works when NVIDIA GLX server modules are available Both methods test with glxinfo before proceeding. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… frames Root cause: projectm_opengl_render_frame() renders to ProjectM's internal buffer, not our external FBO. This caused all frames to be black. Fix: Use projectm_opengl_render_frame_fbo(handle, fbo_id) when an FBO is available. This renders directly to our framebuffer object. Also improved convert.sh GPU initialization: - Add GPU environment diagnostics for debugging - Reject llvmpipe/software rendering (causes black frames with gst-projectm) - Make Xvfb + NVIDIA GLX the preferred method for Vast.ai - Remove DRI requirement for Method 2 (GLX works without DRI access) - Add detailed EGL device enumeration for container environments Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes: - Switch base image from ubuntu:24.04 to nvidia/cuda:12.2.0-devel-ubuntu22.04 - Install nv-codec-headers for NVENC/NVDEC support - Build gst-plugins-bad from source with nvcodec=enabled - Add libnvidia-encode/decode libraries - Include 'video' capability in NVIDIA_DRIVER_CAPABILITIES - Update GST_PLUGIN_PATH to include nvcodec plugin location This enables hardware H.264 encoding via nvh264enc, which is ~2x faster than software x264 encoding and offloads work from the CPU to the GPU's dedicated video encoding hardware (NVENC). Combined with the mesh optimization (640x480 → 220x140), this should enable faster-than-realtime rendering for long audio files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use nvidia/cuda:12.2.0-devel-ubuntu22.04 base image - Install nv-codec-headers for NVENC/NVDEC - Build nvcodec GStreamer plugin from gstreamer 1.20.7 monorepo - Add libnvidia-encode/decode libraries - Include 'video' capability for NVENC access The nvh264enc plugin enables hardware H.264 encoding, offloading encoding from CPU to GPU's dedicated NVENC hardware for ~2x faster video encoding. Image size: 8.79GB (larger due to CUDA devel libraries) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When running in Docker with -e DISPLAY=:0 -v /tmp/.X11-unix:/tmp/.X11-unix, the container should use the host's X server instead of starting its own. This enables: - NVIDIA GPU rendering via host Xorg with NVIDIA driver - NVENC hardware encoding (host GPU access) - Proper FBO rendering (no black frames) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Lambda Labs and other compute-focused cloud instances have CUDA but not OpenGL by default. This change: - Attempts to install libnvidia-gl for EGL/GLX support - Creates /usr/share/glvnd/egl_vendor.d/10_nvidia.json so libglvnd can find NVIDIA's EGL implementation With this, the container can use GPU-accelerated OpenGL rendering when nvidia-container-toolkit injects the host's NVIDIA libraries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed libnvidia-encode-525 and libnvidia-decode-525 packages. These caused NVENC to fail with "unsupported device" when the host runs a different driver version (e.g., 570 vs 525). Kept libnvidia-gl for ProjectM OpenGL rendering (EGL/GLX). nvidia-container-toolkit will inject the correct encode/decode libraries at runtime when NVIDIA_DRIVER_CAPABILITIES=video is set. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The easter-egg property controls a startup logo/feature that shows the ProjectM W logo. Setting it to 0 disables this. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaced projectM's default logo textures with user's custom VJ logo. Added multiple filename variations to cover all possible projectM texture references: - M.tga, m_logo.tga, mlogo.tga - projectm.tga, project.tga - headphones.tga - spiral.tga - logo.tga - pM.tga These will be included in the Docker image and override any default projectM logos that appear during idle/startup.
- Add vj_studio_logo.png for "Made With VJ Studio" overlay - Enable faststart=true on mp4mux for better YouTube streaming Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Load first preset immediately on init to avoid showing idle screen - Add gst_projectm_load_first_timeline_preset() for timeline mode - Prevent timeline_activate from resetting to index -1 if first preset already loaded - Add COPY for vj_studio_logo.png in Dockerfile This fixes the issue where the ProjectM "M" logo would briefly appear at the start of videos before transitioning to the first real preset. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cropped top portion of logo to remove "Made With" text, leaving just the VJ character and "STUDIO" for a cleaner bottom-right watermark appearance. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add elapsed_seconds to timeline switch log message so we can see the actual PTS value when each switch occurs - Add periodic PTS diagnostic (every 600 frames / ~10s) logging both audio and video buffer PTS to detect drift between them - Add render_frame_count to GstProjectMPrivate for frame tracking This helps diagnose an issue where timeline entries get skipped, possibly due to video PTS drifting ahead of audio PTS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When CPU encoding is used (x264enc fallback), video PTS runs at 0.5-0.7x of audio time, causing the timeline engine to skip entries. This resulted in only 90/190 timeline entries being visited for a 53-min DJ set. Audio PTS advances at the true playback rate regardless of video encoding speed, ensuring all timeline entries are visited correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tr_array_sort g_ptr_array_sort() passes each comparison argument as a pointer to the array slot (GstProjectMTimelineEntry**), not a direct pointer to the entry. Without the extra dereference, the comparator was interpreting raw memory addresses as gdouble start_time values, resulting in a semi-random sort order. This caused large sections of the timeline to be unreachable — the fast-path optimization in timeline_find_target_index() would stay stuck on an early index because the "next" entry in the corrupted sort order had a much later start_time, making the before_next check always true. Symptoms: only ~89 of 190 timeline entries visited during a 53-min DJ set render, with 9-17 minute gaps where the same preset played. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Helps verify the sort comparator fix is working by logging start_time/duration/end_time of the first 20 entries after g_ptr_array_sort in gst_projectm_load_timeline(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
G_DEFINE_TYPE_WITH_CODE was initializing the debug category as "gstprojectm" while plugin_init used "projectm". Since the type init runs AFTER plugin_init (via gst_element_register), it overwrote the category variable with "gstprojectm" which didn't match the GST_DEBUG=projectm:4 setting, causing INFO-level diagnostic messages (PTS tracking, sort order verification) to be suppressed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use GST_WARNING_OBJECT instead of GST_INFO_OBJECT for timeline diagnostics so they appear regardless of debug category threshold. Includes "build v62" marker to verify correct binary is running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The build v62 WARNING-level markers were temporary debugging aids to verify the timeline sort fix on RunPod. Now confirmed working (190/190 entries visited), downgrade back to INFO level for production. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… path detection - build.sh: Check ~/.local and /opt/homebrew for ProjectM 4 headers before /usr/local; add PKG_CONFIG_PATH for Apple Silicon - setup.sh: Add gst-plugins-base, gst-plugins-good, gst-plugins-bad, ffmpeg to brew list - convert.sh: Add is_macos() helper; detect macOS and use native CGL/Cocoa OpenGL (skip X11/VirtualGL entirely); add vtenc_h264 encoder selection and pipeline; auto-detect preset paths for macOS (~/.local, /opt/homebrew); fix stat command portability for output monitoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… g_error) On macOS, some community presets trigger transient FBO errors during shader compilation. These errors are recoverable — ProjectM continues rendering on the next frame. Previously, g_error() called abort(), crashing the entire pipeline. Now it logs a warning and continues. This fixes the crash when using preset= property on macOS with the full 10k+ community preset library. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When negotiated downstream format is NV12, projectm now does the
RGBA→NV12 conversion on the GPU via two GLSL passes against its
existing RGBA FBO:
1. Y pass: full-resolution into an R8 texture using BT.601 luma
coefficients (0.257*R + 0.504*G + 0.098*B + 16/255)
2. UV pass: half-resolution into a RG8 texture, linear-filtered
sampling automatically averages 2x2 blocks for 4:2:0
chroma subsampling
ReadPixels then pulls each plane straight into the GstVideoFrame's
NV12 plane data — no intermediate buffer copies.
This eliminates the downstream `videoconvert ABGR→NV12` CPU step
that was the dominant cost on Apple Silicon. End-to-end render
went from 2.3x realtime to 0.8x realtime (363s audio: 842s → 300s)
on the same hardware/preset/audio.
Implementation notes:
- Caps: `video/x-raw, format = { ABGR, NV12 }` — caps negotiation
picks NV12 when downstream advertises it (e.g. vtenc_h264).
- ABGR path remains unchanged for non-NV12 consumers.
- VAO is required by GL 3.2 core (macOS Cocoa context); without one
every draw fails with GL_INVALID_OPERATION.
- Shader uses `#version 150 core` — compatible with the GL 3.2 core
profile macOS exposes. GLES2 path not implemented since this
optimization targets the local Mac renderer; production GPU pods
use a different convert_cuda.sh pipeline.
- PBO async readback is bypassed in NV12 mode — it was a workaround
for the slow CPU conversion, which no longer exists.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
From PR #3 review (P1): - nv12_render now restores projectm's FBO id (not framebuffer 0) so headless EGL contexts don't raise GL_INVALID_OPERATION. Matches the rule the existing ABGR path already follows. - nv12_init cleans up allocated textures + FBOs on the completeness- check early-return paths (previously leaked if R8/RG8 unsupported). - GL_PACK_ALIGNMENT now reset to default (4) after ReadPixels so later code in the same context gets clean state back. - GL_PACK_ROW_LENGTH comments clarified: values are in pixels, not bytes. Y plane (R8) happens to have identical byte/pixel values but is now computed explicitly so the unit is obvious. P2: - BT.601 studio-swing (limited-range) coefficients documented as intentional; pc-range would desync vtenc_h264's color_range=tv output. - Removed per-frame TexParameteri mutations on the source FBO texture. The 4:2:0 subsampling comes from rendering source-sized quad into a half-sized viewport — hardware bilinear on the source texture (GL_LINEAR set once at FBO creation) averages 2x2 blocks naturally. - Dropped the inverted early-exit in nv12_release; per-resource null guards already handle partial-init cleanup. Smoke test post-fixes: same synthetic audiotestsrc → NV12 → vtenc_h264 pipeline still produces valid 4:2:0 H.264 output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per re-review: both nv12_init and nv12_render had `else { bind 0 }`
fallback branches after the `priv->fbo_id != 0` check. nv12_render is
only called from the main frame handler when nv12_mode is on AND
gst_projectm_ensure_render_target returned using_fbo=TRUE — which
guarantees fbo_id is non-zero. The else-branches were dead code that
would, in their one reachable case, do exactly what the comments warn
against: bind framebuffer 0 in headless EGL where it doesn't exist.
Removed the fallback, made the bind unconditional. Comments updated
to explain why no fallback is needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
|
please don't commit every preset here |
Author
|
Oh sorry! I thought I had forked this work out. Oof, I'll fix. Sorry for the unnecessary noise. |
Author
|
Closing — this PR was opened against the wrong repo by mistake. The change lives in our private fork (djj0s3/gst-projectm) and isn't intended for upstream. Apologies for the noise. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the negotiated downstream format is NV12, projectm now does the RGBA→NV12 conversion on the GPU via two GLSL passes against its existing RGBA FBO, then ReadPixels each plane straight into the GstVideoFrame.
This eliminates the downstream `videoconvert ABGR→NV12` CPU step that dominated render time on Apple Silicon. End-to-end render measured against the same audio (6 min track, 1080p30, vtenc_h264):
```
Before: 363s audio → 842s render (2.3x realtime)
After: 363s audio → 300s render (0.8x realtime)
```
2.8x speedup. Render now completes in less time than the audio plays.
How it works
Two shader passes after `projectm_opengl_render_frame_fbo`:
`ReadPixels` then pulls each plane straight into the GstVideoFrame's NV12 plane data with `GL_PACK_ROW_LENGTH` accounting for stride alignment.
Caps
```
video/x-raw, format = { ABGR, NV12 }
```
ABGR path is unchanged for any consumer that doesn't accept NV12. `vtenc_h264` advertises NV12 in its sink caps, so the new path is auto-selected when vtenc is downstream.
Implementation gotchas
Testing
🤖 Generated with Claude Code