Earth Species Project's bioacoustics benchmark library for audio language models.
The core package (beans_next) is dependency-light: no torch, transformers, or vLLM. Models are always reached over HTTP via the predictions_v1 contract. Heavy inference lives in per-launcher virtual environments under examples/servers/.
| Document | What it covers |
|---|---|
| Full evaluation pipeline | Complete end-to-end pipeline: rsync, key setup, weight download, serving, SLURM submit scripts, results retrieval, rescoring |
| Evaluation guide | Per-model serving reference — hardware, local and SLURM instructions, metrics |
| LLM-as-judge guide | All three judge modes (rubric, YES/NO, extractor); built-in templates; retroactive vs inline judging |
| Gemma 4 judge serving | Serving cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit as a predictions_v1 judge via vLLM |
| Launcher serving kit | All launchers, quick-start per model, conformance checks |
| SLURM scripts | Two-job pattern (serving + inference), multi-model side-by-side |
| HTTP contract | predictions_v1 wire schema, batching rules, endpoint spec |
| Paper workflow | NatureLM side-by-side reproduction workflow |
Qwen3-Omni (Slurm) runbook: The most up-to-date operational notes live in
examples/servers/af3/README.mdunder “Qwen3-Omni serving notes” (single-stage vLLM-Omni YAML, known-bad nodes,$USERpath expansion gotcha,/tmp/voice_samplesworkaround, and recommended 10-second audio cap).
Requirements: Python ≥ 3.11, uv
Install uv if you don't have it:
curl -LsSf https://astral.sh/uv/install.sh | shInstall the library and dev dependencies:
git clone <repo>
cd beans-next
uv syncThis creates .venv/ and installs beans_next plus all dev deps. Use uv run to execute anything inside that environment.
BEANS-Next supports two backends for loading evaluation data:
| Backend | Flag | Notes |
|---|---|---|
huggingface |
--backend huggingface |
Loads from HuggingFace Hub Parquet files. No private credentials needed for public datasets. Set HF_TOKEN for private repos. |
esp_data |
--backend esp_data |
Loads from GCS via the esp_data library. Requires GCS credentials and the esp dependency group. |
The default backend when no flag is provided is esp_data (falls back to the BEANS_NEXT_DATA_SOURCE env var, defaulting to esp_data). To use HuggingFace:
uv run beans-next run --backend huggingface --suite beans_zero_core ...To use esp_data, also install the esp group:
uv sync --group espLaunchers (model servers under
examples/servers/) have their own isolated venvs and are set up separately — see the launcher guide.
Many launchers and dataset backends require credentials. Store them in protected config files so they are auto-loaded without appearing in command history:
# HuggingFace token (gated model weights, private HF datasets)
mkdir -p ~/.config/huggingface && chmod 700 ~/.config/huggingface
printf 'hf_...\n' > ~/.config/huggingface/hf_token && chmod 600 ~/.config/huggingface/hf_token
# OpenAI API key (GPT-4o-audio-preview and compatible APIs)
mkdir -p ~/.config/openai && chmod 700 ~/.config/openai
printf 'OPENAI_API_KEY=sk-...\n' > ~/.config/openai/cfg && chmod 600 ~/.config/openai/cfg
# Google AI Studio API key (Gemini models)
mkdir -p ~/.config/gemini && chmod 700 ~/.config/gemini
printf 'AIza...\n' > ~/.config/gemini/cfg && chmod 600 ~/.config/gemini/cfgThe launchers auto-read from these files. You can also pass tokens as environment variables (HF_TOKEN, OPENAI_API_KEY, GEMINI_API_KEY) if you prefer.
See docs/full_evaluation_pipeline.md for full credential setup including GCS auth.
Verify the full pipeline works on CPU using the deterministic dummy launcher:
uv run bash scripts/smoke_test.shThis starts the dummy server, checks contract conformance, runs a small capped suite, and writes artifacts under results/. Takes ~30 seconds.
Pick a model and start its launcher. For full per-model instructions (weights download, SLURM scripts, GPU requirements) see the Evaluation guide.
For NatureLM-audio v1.1 real inference, see
examples/servers/naturelm-v1.1/README.mdfor full setup instructions. Weights are gated —HF_TOKENrequired.
GPT-4o-audio-preview (no GPU, OpenAI API key):
cd examples/servers/openai_compatible_proxy
uv venv && uv pip install -r requirements.txt && . .venv/bin/activate
# Recommended: store your key in a protected file:
# mkdir -p ~/.config/openai && chmod 700 ~/.config/openai
# printf "sk-...\n" > ~/.config/openai/cfg && chmod 600 ~/.config/openai/cfg
# For Gemini via the same proxy:
# mkdir -p ~/.config/gemini && chmod 700 ~/.config/gemini
# printf "AIza...\n" > ~/.config/gemini/cfg && chmod 600 ~/.config/gemini/cfg
OPENAI_PROXY_STUB=0 \
OPENAI_BASE_URL=https://api.openai.com \
OPENAI_MODEL=gpt-4o-audio-preview \
PORT=8000 ./serve.shNatureLM-audio v1.0 / Audio Flamingo Next / Qwen3-Omni (GPU required):
# NatureLM v1.0 — start server
cd examples/servers/naturelm-v1.0
uv venv && uv pip install -r requirements.txt && . .venv/bin/activate
PORT=8000 ./serve.shWith any launcher running at http://127.0.0.1:8000:
# Quick smoke check (3 tasks, 5 examples each):
uv run beans-next run \
--predict-url http://127.0.0.1:8000/predict \
--suite beans_zero_smoke \
--limit 5
# Full evaluation (22 tasks, all examples):
uv run beans-next run \
--predict-url http://127.0.0.1:8000/predict \
--suite beans_zero_core \
--output-dir results/my_runResults are written to --output-dir (default results/<run_id>/):
| File | Content |
|---|---|
predictions.jsonl |
Raw launcher responses |
processed_predictions.jsonl |
Post-processed predictions with ground truth |
scored_predictions.jsonl |
Post-processed predictions with computed scores |
# my_run.yaml
model: gpt4o_audio_openai_api # registry preset from beans_next/registry/model/
suite: beans_zero_core
limit: 50
out_dir: results/my_runuv run beans-next run --config my_run.yamlYou can benchmark different NatureLM v1.1 checkpoints by creating a run config that points the launcher at your checkpoint URI. Start from the generic template:
configs/benchmarks/beans_zero_core_naturelm_v1_1_checkpoint_template.yaml
Customize these fields in the YAML:
models[0].inline.name— unique identifier in outputsmodels[0].inline.description— human-readable checkpoint note
Then start the launcher with your checkpoint URI:
NATURELM_GCS_CHECKPOINT_URI=gs://<your-bucket>/<path-to-checkpoint>/ \
sbatch examples/slurm/serve_naturelm_v1_1.shAnd run with the config:
uv run beans-next run --config configs/benchmarks/beans_zero_core_naturelm_v1_1_checkpoint_template.yaml \
-o results/naturelm_v1_1_custom_$(date +%Y%m%d)Available registry model presets:
| Preset | Model |
|---|---|
dummy_local_8000 |
Deterministic stub (CPU) |
naturelm_v1_0_local_8000 |
NatureLM-audio v1.0 |
naturelm_v1_1_local_8001 |
NatureLM-audio v1.1 |
gpt4o_audio_openai_api |
GPT-4o-audio-preview |
gemini_openai_api |
Gemini (any version) |
qwen3_omni_vllm_local_8000 |
Qwen3-Omni-7B via vLLM |
af_next_local_8000 |
Audio Flamingo Next |
Every run saves three JSONL files to --output-dir:
| File | Content |
|---|---|
predictions.jsonl |
Raw model responses — unprocessed predictions[0] strings |
processed_predictions.jsonl |
After post-processing (fuzzy match, comma split, etc.) |
scored_predictions.jsonl |
After scoring (accuracy, F1, mAP, CIDEr, …) |
Use predictions.jsonl as input to score-from-file — it contains the original raw text so all three rescoring paths (normal pipeline, YES/NO judge, extractor judge) can be applied retroactively without re-running inference.
# Normal pipeline (post-process + metrics)
uv run beans-next score-from-file results/my_run/predictions.jsonl \
-o results/my_run_rescored
# Add YES/NO judge pass
uv run beans-next score-from-file results/my_run/predictions.jsonl \
--judge-url http://127.0.0.1:8010/predict \
-o results/my_run_rescored
# Add extractor judge pass (judge converts raw output → clean label → full metrics)
uv run beans-next score-from-file results/my_run/predictions.jsonl \
--judge-extract-url http://127.0.0.1:8010/predict \
--task-type classification \
-o results/my_run_rescored
# All three at once — output files never overlap (judge_* and judge_extracted_* prefixes)
uv run beans-next score-from-file results/my_run/predictions.jsonl \
--judge-url http://127.0.0.1:8010/predict \
--judge-extract-url http://127.0.0.1:8010/predict \
--task-type classification \
-o results/my_run_rescoredSee examples/slurm/README.md for the two-job pattern (GPU serving job + CPU inference job) and per-model SLURM scripts.
Quick example:
# Submit serving job (GPU node), then inference job (CPU node)
SERVE_JOB=$(sbatch --parsable examples/slurm/serve_af3.sh)
BEANS_NEXT_URL_FILE=$HOME/beans-next-launchers/$SERVE_JOB.url \
BEANS_NEXT_SUITE=beans_zero_core \
sbatch --dependency=after:$SERVE_JOB examples/slurm/run_inference.shBEANS-Next applies deterministic scorers automatically based on task type. All scorers are in beans_next.metrics.
| Task type | Post-processing | Scored by |
|---|---|---|
classification |
comma-split + Levenshtein fuzzy match to label vocab | top1_accuracy, accuracy, precision, recall, f1 |
detection |
comma-split + fuzzy match | average_precision, precision, recall, f1 |
captioning |
whitespace normalise | corpus cider in summary.json (per-sample scores empty); optional spider (CIDEr + SPICE) needs Java |
open_ended, counting, qa |
whitespace normalise | LLM judge (see below) |
| Scorer | Description |
|---|---|
top1_accuracy |
Correct if prediction matches any option in a comma-separated target string ("cat, feline" → both "cat" and "feline" are correct) |
accuracy |
Exact-string match or multilabel exact-row match |
precision, recall, f1 |
Support average= in {"macro", "micro", "weighted", "binary"} |
Label matching uses Levenshtein fuzzy matching (max_distance=5): a predicted string within 5 edit operations of a known label is snapped to that label before scoring.
Per-dataset fixed label vocabularies are loaded from beans_next/registry/beans_zero_labels.json (21 datasets, e.g. ESC-50 → 50 labels, unseen-species-cmn → 202). Inline labels in an eval-task YAML takes priority over the registry.
| Scorer | Description |
|---|---|
average_precision |
Per-label PR-curve integral, averaged across labels (macro) or pooled (micro) |
precision, recall, f1 |
Multi-label, all average= modes supported |
CIDEr — TF-IDF n-gram (1–4) cosine similarity with Gaussian length penalty, computed once over the full test split (corpus IDF). Pure Python / NumPy. Reported as metrics.mean.cider in summary.json (normalized to [0, 1] from the internal ×10 scale).
SPIDEr = (CIDEr / 10 + SPICE) / 2
SPICE — scene-graph F1 via Java subprocess. Requires Java ≥ 8 and Stanford CoreNLP 3.6.0 JARs. Download once:
uv run beans-next setup-spiceJARs are cached to ~/.cache/beans-next/spice/lib/. The registered spider() scorer uses SPICE when available; otherwise SPICE is treated as 0.0 and a warning is logged.
For open-ended tasks (descriptions, counting, QA), deterministic metrics are insufficient because the same correct answer can be phrased many different ways — and models may refuse, hedge, or give structured vs. prose responses unpredictably.
beans-next supports three judge modes. Pick based on your task:
| Mode | Class | Flag | Output |
|---|---|---|---|
| 1 — Rubric judge | JudgeScorer |
--judge-url |
Structured score (0–1) via judge_scores_v1 endpoint |
| 2 — YES/NO judge | PredictV1Judge |
--judge-url |
Binary judge_accuracy via predictions_v1 endpoint |
| 3 — Extractor judge | PredictV1Extractor |
--judge-extract-url |
Structured prediction → full metrics via predictions_v1 endpoint |
Mode 1 — dedicated judge_scores_v1 endpoint (existing rubric-based judge):
uv run beans-next run \
--predict-url http://127.0.0.1:8000/predict \
--judge-url http://127.0.0.1:8010/judge \
--suite beans_zero_coreBuilt-in rubric templates:
| Template id | Best for |
|---|---|
bioacoustic_open_qa_v1 |
Soundscape descriptions, call-type explanations (default) |
bioacoustic_counting_v1 |
"Count vocalizations per species" tasks; handles refusals, common names, flexible formatting |
Per-task template override: add judge: bioacoustic_counting_v1 to the eval-task YAML.
Mode 2 — YES/NO binary scoring via any predictions_v1 model (e.g. Gemma 4):
# Run inline — judge fires after inference completes
uv run beans-next run \
--predict-url http://127.0.0.1:8000/predict \
--judge-url http://127.0.0.1:8010/predict \
--suite beans_zero_core
# Or retroactively on saved predictions
uv run beans-next score-from-file results/my_run/processed_predictions.jsonl \
--judge-url http://127.0.0.1:8010/predict \
-o results/my_run_judgeOutput artifacts: judge_outputs.jsonl, judge_scored_predictions.jsonl, judge_summary.json.
Mode 3 — structured extraction: judge converts verbose model output to a clean label/description, then full metrics run:
uv run beans-next score-from-file results/my_run/processed_predictions.jsonl \
--judge-extract-url http://127.0.0.1:8010/predict \
--task-type classification \
-o results/my_run_extracted--task-type selects the extraction template (classification, detection, captioning). Output: judge_extracted_scored_predictions.jsonl, judge_extracted_summary.json.
Full details: docs/llm_judge.md · Gemma 4 serving guide: docs/judge_model_gemma4.md
A launcher is a self-contained FastAPI server. Minimal contract:
POST /predict — accepts predictions_v1 request, returns predictions_v1 response
GET /info — capability document (name, model, audio_payload_types, …)
GET /health — readiness probe
Start from examples/servers/hf_transformers/ (generic Tier-2 template).
Full contract spec: docs/http_contract.md.
Key rules:
- One response item per request
sample_id; match by id, not array position - HTTP 413 when
len(requests) > max_batch_size - Per-item errors go in
responses[i].error; HTTP status stays 200 - Isolated venv — do not import
beans_next
uv run bash scripts/check_launcher.sh http://127.0.0.1:<port>uv run ruff check --fix .
uv run python -c "import beans_next"
uv run pytest -q- HTTP-only inference — no in-process model execution in core.
- Wire schema —
predictions_v1only. - No heavy deps in core —
beans_nextdoes not depend on torch, transformers, or vLLM.
Full spec: DESIGN.md, AGENT_SPEC.md, INCREMENTS.md.
