Skip to content

earthspecies/beans-next

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEANS-Next

Hugging Face: BEANS-Next Hugging Face: ROOTS Website CI status Pre-commit status

Earth Species Project's bioacoustics benchmark library for audio language models.

The core package (beans_next) is dependency-light: no torch, transformers, or vLLM. Models are always reached over HTTP via the predictions_v1 contract. Heavy inference lives in per-launcher virtual environments under examples/servers/.

Overview diagram

Documentation

Document What it covers
Full evaluation pipeline Complete end-to-end pipeline: rsync, key setup, weight download, serving, SLURM submit scripts, results retrieval, rescoring
Evaluation guide Per-model serving reference — hardware, local and SLURM instructions, metrics
LLM-as-judge guide All three judge modes (rubric, YES/NO, extractor); built-in templates; retroactive vs inline judging
Gemma 4 judge serving Serving cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit as a predictions_v1 judge via vLLM
Launcher serving kit All launchers, quick-start per model, conformance checks
SLURM scripts Two-job pattern (serving + inference), multi-model side-by-side
HTTP contract predictions_v1 wire schema, batching rules, endpoint spec
Paper workflow NatureLM side-by-side reproduction workflow

Qwen3-Omni (Slurm) runbook: The most up-to-date operational notes live in examples/servers/af3/README.md under “Qwen3-Omni serving notes” (single-stage vLLM-Omni YAML, known-bad nodes, $USER path expansion gotcha, /tmp/voice_samples workaround, and recommended 10-second audio cap).


Installation

Requirements: Python ≥ 3.11, uv

Install uv if you don't have it:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install the library and dev dependencies:

git clone <repo>
cd beans-next
uv sync

This creates .venv/ and installs beans_next plus all dev deps. Use uv run to execute anything inside that environment.

Dataset backends

BEANS-Next supports two backends for loading evaluation data:

Backend Flag Notes
huggingface --backend huggingface Loads from HuggingFace Hub Parquet files. No private credentials needed for public datasets. Set HF_TOKEN for private repos.
esp_data --backend esp_data Loads from GCS via the esp_data library. Requires GCS credentials and the esp dependency group.

The default backend when no flag is provided is esp_data (falls back to the BEANS_NEXT_DATA_SOURCE env var, defaulting to esp_data). To use HuggingFace:

uv run beans-next run --backend huggingface --suite beans_zero_core ...

To use esp_data, also install the esp group:

uv sync --group esp

Launchers (model servers under examples/servers/) have their own isolated venvs and are set up separately — see the launcher guide.


Credentials and tokens

Many launchers and dataset backends require credentials. Store them in protected config files so they are auto-loaded without appearing in command history:

# HuggingFace token (gated model weights, private HF datasets)
mkdir -p ~/.config/huggingface && chmod 700 ~/.config/huggingface
printf 'hf_...\n' > ~/.config/huggingface/hf_token && chmod 600 ~/.config/huggingface/hf_token

# OpenAI API key (GPT-4o-audio-preview and compatible APIs)
mkdir -p ~/.config/openai && chmod 700 ~/.config/openai
printf 'OPENAI_API_KEY=sk-...\n' > ~/.config/openai/cfg && chmod 600 ~/.config/openai/cfg

# Google AI Studio API key (Gemini models)
mkdir -p ~/.config/gemini && chmod 700 ~/.config/gemini
printf 'AIza...\n' > ~/.config/gemini/cfg && chmod 600 ~/.config/gemini/cfg

The launchers auto-read from these files. You can also pass tokens as environment variables (HF_TOKEN, OPENAI_API_KEY, GEMINI_API_KEY) if you prefer.

See docs/full_evaluation_pipeline.md for full credential setup including GCS auth.


Quick start

1. Smoke test — no GPU, no API key

Verify the full pipeline works on CPU using the deterministic dummy launcher:

uv run bash scripts/smoke_test.sh

This starts the dummy server, checks contract conformance, runs a small capped suite, and writes artifacts under results/. Takes ~30 seconds.

2. Run against a real model

Pick a model and start its launcher. For full per-model instructions (weights download, SLURM scripts, GPU requirements) see the Evaluation guide.

For NatureLM-audio v1.1 real inference, see examples/servers/naturelm-v1.1/README.md for full setup instructions. Weights are gated — HF_TOKEN required.

GPT-4o-audio-preview (no GPU, OpenAI API key):

cd examples/servers/openai_compatible_proxy
uv venv && uv pip install -r requirements.txt && . .venv/bin/activate

# Recommended: store your key in a protected file:
#   mkdir -p ~/.config/openai && chmod 700 ~/.config/openai
#   printf "sk-...\n" > ~/.config/openai/cfg && chmod 600 ~/.config/openai/cfg
# For Gemini via the same proxy:
#   mkdir -p ~/.config/gemini && chmod 700 ~/.config/gemini
#   printf "AIza...\n" > ~/.config/gemini/cfg && chmod 600 ~/.config/gemini/cfg

OPENAI_PROXY_STUB=0 \
  OPENAI_BASE_URL=https://api.openai.com \
  OPENAI_MODEL=gpt-4o-audio-preview \
  PORT=8000 ./serve.sh

NatureLM-audio v1.0 / Audio Flamingo Next / Qwen3-Omni (GPU required):

# NatureLM v1.0 — start server
cd examples/servers/naturelm-v1.0
uv venv && uv pip install -r requirements.txt && . .venv/bin/activate
PORT=8000 ./serve.sh

3. Run the benchmark

With any launcher running at http://127.0.0.1:8000:

# Quick smoke check (3 tasks, 5 examples each):
uv run beans-next run \
  --predict-url http://127.0.0.1:8000/predict \
  --suite beans_zero_smoke \
  --limit 5

# Full evaluation (22 tasks, all examples):
uv run beans-next run \
  --predict-url http://127.0.0.1:8000/predict \
  --suite beans_zero_core \
  --output-dir results/my_run

Results are written to --output-dir (default results/<run_id>/):

File Content
predictions.jsonl Raw launcher responses
processed_predictions.jsonl Post-processed predictions with ground truth
scored_predictions.jsonl Post-processed predictions with computed scores

4. YAML run config (reproducible runs)

# my_run.yaml
model: gpt4o_audio_openai_api   # registry preset from beans_next/registry/model/
suite: beans_zero_core
limit: 50
out_dir: results/my_run
uv run beans-next run --config my_run.yaml

NatureLM 1.1 checkpoint-specific configs

You can benchmark different NatureLM v1.1 checkpoints by creating a run config that points the launcher at your checkpoint URI. Start from the generic template:

  • configs/benchmarks/beans_zero_core_naturelm_v1_1_checkpoint_template.yaml

Customize these fields in the YAML:

  • models[0].inline.name — unique identifier in outputs
  • models[0].inline.description — human-readable checkpoint note

Then start the launcher with your checkpoint URI:

NATURELM_GCS_CHECKPOINT_URI=gs://<your-bucket>/<path-to-checkpoint>/ \
  sbatch examples/slurm/serve_naturelm_v1_1.sh

And run with the config:

uv run beans-next run --config configs/benchmarks/beans_zero_core_naturelm_v1_1_checkpoint_template.yaml \
  -o results/naturelm_v1_1_custom_$(date +%Y%m%d)

Available registry model presets:

Preset Model
dummy_local_8000 Deterministic stub (CPU)
naturelm_v1_0_local_8000 NatureLM-audio v1.0
naturelm_v1_1_local_8001 NatureLM-audio v1.1
gpt4o_audio_openai_api GPT-4o-audio-preview
gemini_openai_api Gemini (any version)
qwen3_omni_vllm_local_8000 Qwen3-Omni-7B via vLLM
af_next_local_8000 Audio Flamingo Next

5. Re-score without re-running inference

Every run saves three JSONL files to --output-dir:

File Content
predictions.jsonl Raw model responses — unprocessed predictions[0] strings
processed_predictions.jsonl After post-processing (fuzzy match, comma split, etc.)
scored_predictions.jsonl After scoring (accuracy, F1, mAP, CIDEr, …)

Use predictions.jsonl as input to score-from-file — it contains the original raw text so all three rescoring paths (normal pipeline, YES/NO judge, extractor judge) can be applied retroactively without re-running inference.

# Normal pipeline (post-process + metrics)
uv run beans-next score-from-file results/my_run/predictions.jsonl \
  -o results/my_run_rescored

# Add YES/NO judge pass
uv run beans-next score-from-file results/my_run/predictions.jsonl \
  --judge-url http://127.0.0.1:8010/predict \
  -o results/my_run_rescored

# Add extractor judge pass (judge converts raw output → clean label → full metrics)
uv run beans-next score-from-file results/my_run/predictions.jsonl \
  --judge-extract-url http://127.0.0.1:8010/predict \
  --task-type classification \
  -o results/my_run_rescored

# All three at once — output files never overlap (judge_* and judge_extracted_* prefixes)
uv run beans-next score-from-file results/my_run/predictions.jsonl \
  --judge-url http://127.0.0.1:8010/predict \
  --judge-extract-url http://127.0.0.1:8010/predict \
  --task-type classification \
  -o results/my_run_rescored

Running on a SLURM cluster

See examples/slurm/README.md for the two-job pattern (GPU serving job + CPU inference job) and per-model SLURM scripts.

Quick example:

# Submit serving job (GPU node), then inference job (CPU node)
SERVE_JOB=$(sbatch --parsable examples/slurm/serve_af3.sh)

BEANS_NEXT_URL_FILE=$HOME/beans-next-launchers/$SERVE_JOB.url \
BEANS_NEXT_SUITE=beans_zero_core \
sbatch --dependency=after:$SERVE_JOB examples/slurm/run_inference.sh

Metrics

BEANS-Next applies deterministic scorers automatically based on task type. All scorers are in beans_next.metrics.

Scoring by task type

Task type Post-processing Scored by
classification comma-split + Levenshtein fuzzy match to label vocab top1_accuracy, accuracy, precision, recall, f1
detection comma-split + fuzzy match average_precision, precision, recall, f1
captioning whitespace normalise corpus cider in summary.json (per-sample scores empty); optional spider (CIDEr + SPICE) needs Java
open_ended, counting, qa whitespace normalise LLM judge (see below)

Classification

Scorer Description
top1_accuracy Correct if prediction matches any option in a comma-separated target string ("cat, feline" → both "cat" and "feline" are correct)
accuracy Exact-string match or multilabel exact-row match
precision, recall, f1 Support average= in {"macro", "micro", "weighted", "binary"}

Label matching uses Levenshtein fuzzy matching (max_distance=5): a predicted string within 5 edit operations of a known label is snapped to that label before scoring.

Per-dataset fixed label vocabularies are loaded from beans_next/registry/beans_zero_labels.json (21 datasets, e.g. ESC-50 → 50 labels, unseen-species-cmn → 202). Inline labels in an eval-task YAML takes priority over the registry.

Detection / multi-label

Scorer Description
average_precision Per-label PR-curve integral, averaged across labels (macro) or pooled (micro)
precision, recall, f1 Multi-label, all average= modes supported

Captioning — CIDEr (default)

CIDEr — TF-IDF n-gram (1–4) cosine similarity with Gaussian length penalty, computed once over the full test split (corpus IDF). Pure Python / NumPy. Reported as metrics.mean.cider in summary.json (normalized to [0, 1] from the internal ×10 scale).

Captioning — SPIDEr (optional, Java)

SPIDEr = (CIDEr / 10 + SPICE) / 2

SPICE — scene-graph F1 via Java subprocess. Requires Java ≥ 8 and Stanford CoreNLP 3.6.0 JARs. Download once:

uv run beans-next setup-spice

JARs are cached to ~/.cache/beans-next/spice/lib/. The registered spider() scorer uses SPICE when available; otherwise SPICE is treated as 0.0 and a warning is logged.

LLM-as-judge

For open-ended tasks (descriptions, counting, QA), deterministic metrics are insufficient because the same correct answer can be phrased many different ways — and models may refuse, hedge, or give structured vs. prose responses unpredictably.

beans-next supports three judge modes. Pick based on your task:

Mode Class Flag Output
1 — Rubric judge JudgeScorer --judge-url Structured score (0–1) via judge_scores_v1 endpoint
2 — YES/NO judge PredictV1Judge --judge-url Binary judge_accuracy via predictions_v1 endpoint
3 — Extractor judge PredictV1Extractor --judge-extract-url Structured prediction → full metrics via predictions_v1 endpoint

Mode 1 — dedicated judge_scores_v1 endpoint (existing rubric-based judge):

uv run beans-next run \
  --predict-url http://127.0.0.1:8000/predict \
  --judge-url   http://127.0.0.1:8010/judge \
  --suite beans_zero_core

Built-in rubric templates:

Template id Best for
bioacoustic_open_qa_v1 Soundscape descriptions, call-type explanations (default)
bioacoustic_counting_v1 "Count vocalizations per species" tasks; handles refusals, common names, flexible formatting

Per-task template override: add judge: bioacoustic_counting_v1 to the eval-task YAML.

Mode 2 — YES/NO binary scoring via any predictions_v1 model (e.g. Gemma 4):

# Run inline — judge fires after inference completes
uv run beans-next run \
  --predict-url http://127.0.0.1:8000/predict \
  --judge-url   http://127.0.0.1:8010/predict \
  --suite beans_zero_core

# Or retroactively on saved predictions
uv run beans-next score-from-file results/my_run/processed_predictions.jsonl \
  --judge-url http://127.0.0.1:8010/predict \
  -o results/my_run_judge

Output artifacts: judge_outputs.jsonl, judge_scored_predictions.jsonl, judge_summary.json.

Mode 3 — structured extraction: judge converts verbose model output to a clean label/description, then full metrics run:

uv run beans-next score-from-file results/my_run/processed_predictions.jsonl \
  --judge-extract-url http://127.0.0.1:8010/predict \
  --task-type classification \
  -o results/my_run_extracted

--task-type selects the extraction template (classification, detection, captioning). Output: judge_extracted_scored_predictions.jsonl, judge_extracted_summary.json.

Full details: docs/llm_judge.md · Gemma 4 serving guide: docs/judge_model_gemma4.md


Adding a new launcher

A launcher is a self-contained FastAPI server. Minimal contract:

POST /predict  — accepts predictions_v1 request, returns predictions_v1 response
GET  /info     — capability document (name, model, audio_payload_types, …)
GET  /health   — readiness probe

Start from examples/servers/hf_transformers/ (generic Tier-2 template). Full contract spec: docs/http_contract.md.

Key rules:

  • One response item per request sample_id; match by id, not array position
  • HTTP 413 when len(requests) > max_batch_size
  • Per-item errors go in responses[i].error; HTTP status stays 200
  • Isolated venv — do not import beans_next

Launcher conformance check

uv run bash scripts/check_launcher.sh http://127.0.0.1:<port>

Development

uv run ruff check --fix .
uv run python -c "import beans_next"
uv run pytest -q

Architecture

  • HTTP-only inference — no in-process model execution in core.
  • Wire schemapredictions_v1 only.
  • No heavy deps in corebeans_next does not depend on torch, transformers, or vLLM.

Full spec: DESIGN.md, AGENT_SPEC.md, INCREMENTS.md.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors