A React + FastAPI image annotation tool for Databricks Apps. Supports classification and bounding-box detection labeling, with project versioning, dataset export, and a Databricks dark-themed UI.
Built on Lakebase (managed PostgreSQL) for persistent storage with automatic Lakehouse Sync to Delta tables. Images are served from Unity Catalog Volumes.
- Two labeling modes: Classification (single-click) and Bounding Box Detection (draw + assign class)
- Project-based workflow: create projects from UC Volumes with custom class lists
- Keyboard-driven labeling: number keys for class selection with visual flash feedback, arrow keys for navigation
- Sample scrubber: navigate forward/backward through images, revisit and re-label previous samples
- Gallery view: thumbnail grid with status filters (all/unlabeled/labeled/skipped)
- Project versioning: clone projects to create new versions for iterative labeling
- Dataset export: one-click export to UC Volume in COCO JSON (detection) or CSV (classification) format
- Lakebase integration: auto-provisioned by default (opt out for DAB-managed deployments) with token refresh and Lakehouse Sync to Delta
- Multi-user support: user identity via Databricks SSO, per-user labeling stats
Browse UC Volumes, select a task type, define your class list, and create a labeling project.
Single-click labeling with keyboard shortcuts — press a number key to assign a class and auto-advance.
Draw bounding boxes on images and assign classes. Navigate between samples to review and re-label.
Browser ──> React SPA (Vite) ──> FastAPI backend ──> Lakebase (PostgreSQL)
│
├──> UC Volumes (images)
└──> Databricks SDK (workspace client)
The FastAPI backend serves the React SPA as static files and provides the /api/ endpoints. On startup it auto-provisions a Lakebase project (by default — see LAKEBASE_AUTO_PROVISION) with a background thread that refreshes database tokens every 20 minutes.
| Page | Route | Description |
|---|---|---|
| Projects | / |
List all projects, create new ones |
| Create Project | /projects/new |
Pick UC Volume, set task type + class list |
| Project Dashboard | /projects/:id |
Stats, gallery grid, export, start labeling |
| Labeling View | /projects/:id/label |
Annotate images — classify or draw bounding boxes |
| Browse Volumes | /browse |
Navigate Catalog > Schema > Volume to preview images |
| Admin | /admin |
Lakebase status, DB connection info |
Recommended: use the Databricks Asset Bundle in this repo (databricks bundle deploy --target dev). It deploys the pre-annotate job and a bundle-defined app (resources/cv_explorer_app.yml) with declarative App resources (Lakebase postgres, UC volume, model serving endpoint, job) so they show in the Apps UI and app.yml can use valueFrom.
- Adjust
databricks.yml→variablesfor your workspace (especiallydemo_volume_full_name,lakebase_postgres_branch,lakebase_postgres_database,serving_endpoint_name,app_name). databricks bundle deploy --target dev(with a configured CLI profile for that workspace).- The app runs
app.yml→python start.py→ FastAPI + Uvicorn. With a Lakebase postgres App resource, the platform injectsPGHOST/PGUSER/ etc.; the app uses those instead of SDK auto-provision (LAKEBASE_AUTO_PROVISION=falseinapp.yml).
Alternatively, create an app manually from Git and copy the app.yml pattern — you must still attach matching resources in the UI and use the same valueFrom keys.
The tile on the Databricks Apps overview is an app thumbnail on the workspace object, not a static file served by FastAPI.
-
Store the image in this repo as
assets/databricks-app-thumbnail.jpg(or.jpeg/.pngwith that basename). Any reasonable resolution and modest file size is fine. -
Upload it once per app (after the app exists in the workspace):
python scripts/upload_app_thumbnail.py <your-app-name>
Or with a custom path:
python scripts/upload_app_thumbnail.py <your-app-name> --image /path/to/icon.pngThis wraps
databricks apps update-app-thumbnailwith the JSON shape{"app_thumbnail":{"thumbnail":"<base64>"}}required by the Apps API.
With bundle deploy, resources are declared in YAML and appear on the app’s Resources tab. The service principal is the app identity; permissions are set on each resource (e.g. READ_VOLUME on the demo volume, CAN_QUERY on the serving endpoint, CAN_MANAGE_RUN on the pre-annotate job).
Export to additional volumes still requires WRITE_VOLUME on those paths (add another uc_securable resource or grant the app SP in Unity Catalog).
| Variable | Default | Description |
|---|---|---|
DATABRICKS_APP_PORT |
8000 |
Port for the FastAPI server |
DEMO_VOLUME_PATH |
(from valueFrom: demo-volume) |
UC volume path injected from the demo-volume App resource (app.yml). |
SERVING_ENDPOINT |
(from valueFrom: serving-endpoint) |
Serving endpoint name from the serving-endpoint resource. |
PRE_ANNOTATE_DATABRICKS_JOB_ID |
(from valueFrom: app-preannotate-job) |
Numeric job id from the app-preannotate-job resource (bundle-bound job). |
LAKEBASE_AUTO_PROVISION |
false (bundle app) |
With a postgres App resource, use false and connect via injected PG* variables (no SDK project creation on app startup). |
LAKEBASE_PROJECT_ID |
cv-explorer |
Still used by async pre-annotate job clusters (SDK Lakebase path on the job). |
PGHOST, PGUSER, … |
(platform) | Set automatically when a Lakebase postgres database resource is attached. |
When deploying this app as part of a larger Databricks Asset Bundle (DAB), the bundle typically owns the Lakebase project lifecycle (create / grant roles / enable Lakehouse Sync / destroy). To prevent the app from trying to create a duplicate project at startup, set:
# app.yml
env:
- name: LAKEBASE_AUTO_PROVISION
value: "false"
- name: LAKEBASE_PROJECT_ID
value: "<project-id-created-by-your-bundle>"With LAKEBASE_AUTO_PROVISION=false, the app:
- Looks up the Lakebase project named by
LAKEBASE_PROJECT_ID - Connects if it exists
- Exits with a clear error if it does not (instead of creating one)
See the Databricks Apps → Lakebase resources documentation for how DAB declares Postgres project ownership and permissions.
This repo includes a bundle job (resources/preannotate_job.job.yml) and API routes so large pre-label batches run on a cluster instead of inside the app HTTP worker.
- From the repo root:
databricks bundle deploy(with a configured profile). - With
resources/cv_explorer_app.yml, the app receivesPRE_ANNOTATE_DATABRICKS_JOB_IDviavalueFrom: app-preannotate-job(no manual copy). The job cluster still needs Lakebase SDK env (LAKEBASE_AUTO_PROVISION=false,LAKEBASE_PROJECT_ID) set inresources/preannotate_job.job.yml.
The dashboard shows Pre-label (job) when the job id resolves. The job runs scripts/preannotate_job.py with the preannotate_runs row id; it reuses the same Lakebase/SQLite and model-serving code paths as the app.
In addition to UI-driven labeling, projects can be bulk-populated via
POST /api/projects/{project_id}/import. The endpoint reads a label
file from a UC Volume by reference (so payload size is unbounded from
the HTTP side), validates every row, and commits in a single
transaction.
{
"volume_path": "/Volumes/<catalog>/<schema>/<vol>/labels.jsonl",
"format": "jsonl",
"on_missing_sample": "error",
"on_existing_annotations": "replace",
"dry_run": false
}| Field | Default | Values |
|---|---|---|
volume_path |
— | UC Volume path readable by the app |
format |
— | coco | jsonl |
on_missing_sample |
error |
error | skip | create |
on_existing_annotations |
replace |
replace | append | skip |
dry_run |
false |
boolean |
JSONL — one JSON object per line, blank lines ignored:
{"filename": "cat_001.jpg", "annotations": [{"label": "cat", "ann_type": "classification"}]}
{"filename": "dog_042.jpg", "annotations": [{"label": "dog", "ann_type": "bbox", "bbox_json": {"x": 0.1, "y": 0.2, "w": 0.3, "h": 0.4}}]}Bbox coordinates are normalized (0-1). Filenames are matched against
project_samples.filename (basename).
COCO — standard COCO JSON. The adapter converts pixel bboxes
[x, y, w, h] to normalized coordinates using the image width /
height from the images[] section. iscrowd and segmentation are
ignored.
on_missing_sampleerror— unknown filename fails the whole importskip— unknown filenames are silently skippedcreate— creates a newProjectSamplerow if the file exists undersource_volume; otherwise errors
on_existing_annotationsreplace— deletes existing annotations for the sample before inserting (emitsAnnotationHistorydelete rows)append— adds to existing annotations (natural for multi-bbox detection)skip— leaves samples that already have annotations untouched
dry_run— runs pass 1 only, returns the counters that would result. Note: dry-run counters are conservative — they assume every import succeeds asannotations_created. Actualannotations_replacedorsamples_skippedmay differ whenon_existing_annotationsisreplaceorskip. A future PR will prefetch existing-annotation counts per sample for exact dry-run parity.
200— success, body has counters (samples_touched,annotations_created,annotations_replaced,samples_skipped,samples_created)400— bad format, unreadable volume_path, invalid flag value404— project not found422— validation failed, body haserrors[](capped at 100) anderror_count500— commit failed, transaction rolled back
- Soft cap: 500,000 items per request. Split larger imports.
- Hard cap: 200 MB per file. Requests larger than this return
400before any parsing. - Hard cap: 2,000,000 annotations per import (protects against pathological COCO files with one image and millions of annotations).
- Filenames must be basenames — no path separators, no
.., no empty segments. COCOfile_namevalues that include subdirectories are rejected; this keepsproject_samples.filenameconsistent withscan_volume_for_sampleswhich stores basenames. volume_pathmust start with/Volumes/and contain no..segments. Local-filesystem paths are rejected (except in tests via a dedicatedX-Test-Allow-Local-Pathheader that production never sets).- No per-project ACLs — anyone who can call the app can import.
replaceis content-idempotent; re-running produces the same state. Replacing with zero annotations transitions the sample back tostatus='unlabeled'.appendis not idempotent — re-running duplicates annotations.- Duplicate filenames within a single import are rejected in pass 1 (no partial inserts).
import requests
r = requests.post(
"https://<app>.databricksapps.com/api/projects/42/import",
json={
"volume_path": "/Volumes/my_catalog/my_schema/imports/labels.jsonl",
"format": "jsonl",
"on_missing_sample": "error",
"on_existing_annotations": "replace",
},
headers={"Authorization": f"Bearer {token}"},
)
r.raise_for_status()
print(r.json())cv-explorer/
├── app.yml # Databricks App manifest (Git deploy expects app.yml)
├── start.py # Uvicorn entrypoint
├── requirements.txt # Python dependencies
├── backend/
│ ├── main.py # FastAPI app entry point + startup
│ ├── models.py # SQLAlchemy models (Project, Sample, Annotation)
│ ├── schemas.py # Pydantic request/response schemas
│ ├── deps.py # Shared dependencies (DB session, workspace client)
│ ├── lakebase.py # Lakebase auto-provisioning + token refresh
│ ├── volumes.py # UC Volume helper functions
│ └── routes/
│ ├── projects.py # Project CRUD endpoints
│ ├── labeling.py # Annotation + sample endpoints
│ ├── export.py # Dataset export to UC Volumes
│ ├── browse.py # Volume browsing endpoints
│ └── admin.py # Admin + Lakebase status endpoints
├── frontend/
│ ├── src/
│ │ ├── api/client.js # Axios API client
│ │ ├── components/
│ │ │ ├── BBoxCanvas.jsx # Canvas for drawing bounding boxes
│ │ │ ├── AnnotationCanvas.jsx# Display-only annotation overlay
│ │ │ ├── Layout.jsx # App shell with sidebar navigation
│ │ │ ├── FilterableSelect.jsx# Searchable dropdown component
│ │ │ └── Spinner.jsx # Loading indicator
│ │ └── pages/
│ │ ├── ProjectsPage.jsx # Project listing
│ │ ├── CreateProject.jsx # New project form
│ │ ├── ProjectDashboard.jsx# Stats, gallery, export
│ │ ├── LabelingView.jsx # Annotation interface
│ │ ├── BrowseVolumes.jsx # UC Volume browser
│ │ └── AdminPage.jsx # Lakebase admin panel
│ └── vite.config.js # Vite config with /api proxy
└── docs/
├── phase1-design.md # Phase 1 architecture design
└── plans/ # Implementation plans and design docs
Three tables, auto-created on startup with REPLICA IDENTITY FULL for Lakehouse Sync:
- labeling_projects — name, task_type (classification/detection), class_list (JSON), source_volume, version, parent_project_id
- project_samples — filepath, filename, status (unlabeled/labeled/skipped), locked_by/locked_at
- annotations — label, ann_type (classification/bbox), bbox_json (normalised 0-1 coords)
Bounding boxes are stored as normalised [0, 1] coordinates: {"x": float, "y": float, "w": float, "h": float}.
From the Project Dashboard, click Export Dataset to export labeled data to a UC Volume:
- Detection projects: COCO JSON format (
annotations.json+images/directory) - Classification projects: CSV format (
labels.csv+images/directory) - Both include a
metadata.jsonwith project info, class list, and export stats
Bounding box coordinates are converted from normalised (0-1) to absolute pixels in the COCO output.
- Frontend: React 19, Vite, React Router
- Backend: FastAPI, SQLAlchemy 2, Pydantic 2
- Database: Lakebase (managed PostgreSQL on Databricks)
- Image storage: Unity Catalog Volumes
- Auth: Databricks SSO (via Databricks Apps)
- SDK: databricks-sdk for Lakebase, Volumes, and workspace APIs


