Burla Examples

Plain Python in. Remote hardware out. These examples scale copyable scripts onto CPUs, A100 GPUs, custom Docker images, and explicit concurrency limits without turning the project into a distributed-systems rewrite.

21 example folders
from one-file fan-out to full pipelines 9 live demos
with published findings and artifacts CPU, GPU, Docker
changed per function call One Python API
remote_parallel_map

Live gallery · Burla docs · Pick a collection · What Burla shows off

Pick a collection

Collection	Start here if you want to see...	Examples
Data stories with live sites	finished, explorable outputs built from large public datasets	Airbnb, Kentucky Derby, Amazon Reviews, NYC Taxi, arXiv, The Met, World Photo Index, GitHub READMEs, Hospital Prices
ML, embeddings, and vision	model-heavy jobs where runtime and hardware choice matter	A100 embeddings, batch inference
Production data jobs	the scripts data teams actually need to make fast and reliable	image resize, Parquet, pandas, ETL, APIs, scraping
Native tools and simulations	binaries, geospatial dependencies, and massive independent compute	BWA-MEM, GDAL, Monte Carlo

Data stories with live sites

These are the showpieces: real corpora, real scale, and static sites you can open before reading a single line of code.

Airbnb at continental scale

119 cities, 1.7M photos, 50.7M reviews.

CLIP scores images, A100s embed review shortlists, Claude validates visual finds, and bootstrap CIs test whether the weird stuff affects demand.

Live demo · Source

Kentucky Derby prediction and audit

1T Monte Carlo sims in 18.3 minutes, then 2,000 audit permutations in 13.8 seconds.

The same Burla cluster runs the prediction and then stress-tests the model hard enough to publish where the method is fragile.

Live demo · Prediction source · Demo source

Amazon Review Distiller

571M reviews, 275GB JSONL, 500+ parallel CPUs.

Score every public Amazon review deterministically, keep tiny heaps per shard, and reduce them into searchable findings.

Live demo · Source

NYC Ghost Neighborhoods

2.76B taxi and FHV trips in about 15 seconds.

Scan every monthly public trip file to find zones that faded, recovered, or became newly important after the pandemic.

Live demo · Source

Fossils of the arXiv

2.71M abstracts embedded and clustered.

Embed the full arXiv metadata corpus to find extinct topics, emerging clusters, and isolated papers.

Live demo · Source

The Met's Hidden Twins

192K public-domain artwork images.

Fetch Open Access museum images, embed them with CLIP, search with FAISS, and surface visual near-duplicates across centuries.

Live demo · Source

World Photo Index

9.49M geotagged Flickr photos, 967 workers, about 8 minutes.

Reverse-geocode public photos and build country-level signatures from user-written tags.

Live demo · Source

One Million GitHub READMEs

1.2M READMEs, 2.3B upstream file rows.

Shard deterministic summarizers, write per-shard JSON to shared storage, and reduce category stats without calling an LLM.

Live demo · Source

Hospital Price Reality Check

5,162 US hospital MRFs, 1.3M priced line items, 1,040 parallel CPUs in ~19 minutes.

Pull every hospital's machine-readable file, parse 5 different formats (CMS v3 JSON, tall CSV, wide CSV, XLSX, ZIP), and build a chargemaster comparison site for 361 standard codes.

Live demo · Source

ML, embeddings, and vision

These examples are about changing the machine under a Python function: CPUs for download and preprocessing, GPUs for inference, and custom images when the runtime actually matters.

GPU embeddings on A100s

50K Wikipedia articles across CPU and A100 stages.

Download text on CPU workers, embed with a custom CUDA image, write vector shards, and search locally.

Batch inference without serving

10M text rows scored as a batch job.

Load a Hugging Face model once per worker and score Parquet batches without standing up an endpoint.

Production data jobs

The practical middle of the repo: common data work that usually becomes slow, fragile, or over-orchestrated when it leaves one laptop.

Millions of image resizes Chunk S3 image keys, resize with Pillow, write outputs back to S3, and stream progress.	One Parquet file per worker Compute QA stats across thousands of files without starting Spark for a file-parallel job.	Pandas apply in parallel Keep the row-wise Python function and scale the partitioned dataset around it.
ETL without Airflow Transform 10,000 gzipped JSON drops while protecting Postgres with `max_parallelism`.	Rate-limited API jobs Run millions of requests while keeping provider limits explicit in chunking, sleeps, and concurrency.	Parallel web scraping Scrape large static archives with retries, error rows, connection reuse, and a global cap.

Native tools and simulations

Examples for workloads that do not fit neatly into dataframe systems: native binaries, geospatial stacks, and embarrassingly parallel simulation.

What Burla Is Showing Off Here

Capability	Where it shows up
Change hardware per call	CPU photo scoring in Airbnb, A100 embedding in Wikipedia/arXiv/The Met, CPU-only simulation in Kentucky Derby
Change runtime per function	CUDA image for embeddings, GDAL image for raster jobs, BWA/samtools image for genomics
Keep plain Python control flow	scripts call `remote_parallel_map` directly instead of rewriting into Spark, Ray, Airflow, or Kubernetes objects
Put concurrency in the code	API limits, Postgres protection, website politeness, and cluster quota control live next to the workload
Stream useful artifacts back	generated sites, Parquet shards, vector indexes, JSON outputs, and progress from `generator=True`

from burla import remote_parallel_map

cpu_results = remote_parallel_map(
    parse_one_file,
    files,
    func_cpu=2,
    func_ram=8,
    max_parallelism=1000,
    generator=True,
)

gpu_vectors = remote_parallel_map(
    embed_one_shard,
    shards,
    func_gpu="A100",
    image="my-cuda-worker:latest",
    max_parallelism=8,
)

api_rows = remote_parallel_map(
    call_one_endpoint,
    request_batches,
    max_parallelism=64,
    generator=True,
)

Links

Burla docs: https://burla.dev
Live examples gallery: https://burla-cloud.github.io/examples/
Burla GitHub: https://github.com/Burla-Cloud

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.github/workflows		.github/workflows
airbnb-burla-demo		airbnb-burla-demo
amazon-review-distiller		amazon-review-distiller
arxiv-fossils		arxiv-fossils
assets/readme		assets/readme
bioinformatics-alignment		bioinformatics-alignment
gdal-raster-processing		gdal-raster-processing
github-repo-summarizer		github-repo-summarizer
gpu-embedding-demo		gpu-embedding-demo
hospital-price-reality-check		hospital-price-reality-check
image-dataset-resize		image-dataset-resize
kentucky-derby-demo		kentucky-derby-demo
kentucky-derby-prediction		kentucky-derby-prediction
met-weirdest-art		met-weirdest-art
ml-inference-batch		ml-inference-batch
monte-carlo-simulation		monte-carlo-simulation
nyc-ghost-neighborhoods		nyc-ghost-neighborhoods
pandas-apply-parallel		pandas-apply-parallel
parallel-web-scraping		parallel-web-scraping
parquet-parallel		parquet-parallel
python-etl-no-airflow		python-etl-no-airflow
rate-limited-api-requests		rate-limited-api-requests
world-photo-index		world-photo-index
README.md		README.md
index.html		index.html

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Burla Examples

Pick a collection

Data stories with live sites

ML, embeddings, and vision

Production data jobs

Native tools and simulations

What Burla Is Showing Off Here

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages