Skip to content

feat: Add depth estimation support (CoreML Depth Anything Small) #6

@ebowwa

Description

@ebowwa

Feature: Depth Estimation Support

Add support for CoreML Depth Anything Small - a depth estimation model optimized for Apple Silicon.

What It Does

Input: Single image
Output: Per-pixel depth map (grayscale heatmap where brightness = depth)

Model Details:

  • Architecture: DPT (Dense Prediction Transformer) with DINOv2 backbone
  • Size: 24.8M parameters
  • F16 variant: 45.8MB, ~25-34ms inference on Apple Silicon

Why Add This?

1. 3D Scene Understanding

  • Know how far objects are (not just where)
  • Critical for AR/VR applications
  • Enables proper 3D positioning

2. Better Meta Glasses Integration

  • Current: 2D bounding boxes on flat video
  • With depth: True 3D spatial understanding
  • Objects can be placed correctly in 3D space

3. SAM2 + Depth = Full 3D

  • SAM2: Object segmentation masks
  • Depth: Z-coordinate for each pixel
  • Combined: Complete 3D model of scene

4. Pre-processing for Training

  • Depth-aware image augmentation
  • Generate synthetic training data
  • Better depth-aware model training

Technical Requirements

Input Handling:

  • Load images (already supported)
  • Pass to CoreML model (need MLFeatureValue for images)

Output Processing:

  • Parse multi-array depth map output
  • Convert to UIImage/CGImage for display
  • Apply colormap (grayscale or rainbow heatmap)

Visualization:

  • Overlay depth heatmap on video/image
  • Add depth slider for filtering
  • Show depth value on hover/click
  • Color map options (viridis, plasma, rainbow, etc.)

Model Loading:

  • Download from HuggingFace: apple/coreml-depth-anything-small
  • Supports .mlpackage format
  • Already optimized for Apple Neural Engine

Integration Points

New Model Type:

enum ModelKind {
    case detector
    case classifier
    case embedding
    case depthEstimation  // NEW
}

New DetectedObject:

struct DetectedObject {
    // Existing fields for detection/classification
    // NEW: depth map data
    var depthMap: CGImage?
    var minDepth: Float?
    var maxDepth: Float?
}

Performance Considerations

  • Speed: ~25-34ms per inference (fast enough for real-time)
  • Size: 45.8MB (F16) - reasonable
  • Hardware: Runs on Neural Engine (doesn't block GPU)

Use Cases

  1. AR on Meta Glasses - Proper 3D object placement
  2. Scene understanding - Know room dimensions, object distances
  3. Training data - Capture depth-aware datasets
  4. 3D reconstruction - Build 3D models from 2D images

Related

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions