Skip to content

Burla-Cloud/examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

286 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Burla Examples

Plain Python in. Remote hardware out. These examples scale copyable scripts onto CPUs, A100 GPUs, custom Docker images, and explicit concurrency limits without turning the project into a distributed-systems rewrite.

21 example folders
from one-file fan-out to full pipelines
9 live demos
with published findings and artifacts
CPU, GPU, Docker
changed per function call
One Python API
remote_parallel_map

Live gallery · Burla docs · Pick a collection · What Burla shows off

Pick a collection

Collection Start here if you want to see... Examples
Data stories with live sites finished, explorable outputs built from large public datasets Airbnb, Kentucky Derby, Amazon Reviews, NYC Taxi, arXiv, The Met, World Photo Index, GitHub READMEs, Hospital Prices
ML, embeddings, and vision model-heavy jobs where runtime and hardware choice matter A100 embeddings, batch inference
Production data jobs the scripts data teams actually need to make fast and reliable image resize, Parquet, pandas, ETL, APIs, scraping
Native tools and simulations binaries, geospatial dependencies, and massive independent compute BWA-MEM, GDAL, Monte Carlo

Data stories with live sites

These are the showpieces: real corpora, real scale, and static sites you can open before reading a single line of code.

Airbnb at continental scale

119 cities, 1.7M photos, 50.7M reviews.

CLIP scores images, A100s embed review shortlists, Claude validates visual finds, and bootstrap CIs test whether the weird stuff affects demand.

Live demo · Source

Kentucky Derby prediction and audit

1T Monte Carlo sims in 18.3 minutes, then 2,000 audit permutations in 13.8 seconds.

The same Burla cluster runs the prediction and then stress-tests the model hard enough to publish where the method is fragile.

Live demo · Prediction source · Demo source

Amazon Review Distiller

571M reviews, 275GB JSONL, 500+ parallel CPUs.

Score every public Amazon review deterministically, keep tiny heaps per shard, and reduce them into searchable findings.

Live demo · Source

NYC Ghost Neighborhoods

2.76B taxi and FHV trips in about 15 seconds.

Scan every monthly public trip file to find zones that faded, recovered, or became newly important after the pandemic.

Live demo · Source

Fossils of the arXiv

2.71M abstracts embedded and clustered.

Embed the full arXiv metadata corpus to find extinct topics, emerging clusters, and isolated papers.

Live demo · Source

The Met's Hidden Twins

192K public-domain artwork images.

Fetch Open Access museum images, embed them with CLIP, search with FAISS, and surface visual near-duplicates across centuries.

Live demo · Source

World Photo Index

9.49M geotagged Flickr photos, 967 workers, about 8 minutes.

Reverse-geocode public photos and build country-level signatures from user-written tags.

Live demo · Source

One Million GitHub READMEs

1.2M READMEs, 2.3B upstream file rows.

Shard deterministic summarizers, write per-shard JSON to shared storage, and reduce category stats without calling an LLM.

Live demo · Source

Hospital Price Reality Check

5,162 US hospital MRFs, 1.3M priced line items, 1,040 parallel CPUs in ~19 minutes.

Pull every hospital's machine-readable file, parse 5 different formats (CMS v3 JSON, tall CSV, wide CSV, XLSX, ZIP), and build a chargemaster comparison site for 361 standard codes.

Live demo · Source

ML, embeddings, and vision

These examples are about changing the machine under a Python function: CPUs for download and preprocessing, GPUs for inference, and custom images when the runtime actually matters.

GPU embeddings on A100s

50K Wikipedia articles across CPU and A100 stages.

Download text on CPU workers, embed with a custom CUDA image, write vector shards, and search locally.

Batch inference without serving

10M text rows scored as a batch job.

Load a Hugging Face model once per worker and score Parquet batches without standing up an endpoint.

Production data jobs

The practical middle of the repo: common data work that usually becomes slow, fragile, or over-orchestrated when it leaves one laptop.

Image dataset resize

Chunk S3 image keys, resize with Pillow, write outputs back to S3, and stream progress.

Parquet fan-out

Compute QA stats across thousands of files without starting Spark for a file-parallel job.

Pandas apply parallel

Keep the row-wise Python function and scale the partitioned dataset around it.

Python ETL without Airflow

Transform 10,000 gzipped JSON drops while protecting Postgres with max_parallelism.

Rate-limited API requests

Run millions of requests while keeping provider limits explicit in chunking, sleeps, and concurrency.

Parallel web scraping

Scrape large static archives with retries, error rows, connection reuse, and a global cap.

Native tools and simulations

Examples for workloads that do not fit neatly into dataframe systems: native binaries, geospatial stacks, and embarrassingly parallel simulation.

Bioinformatics alignment

Run BWA-MEM and samtools in a custom image with one paired-end FASTQ sample per worker.

GDAL raster processing

Compute NDVI, clip, or reproject one Sentinel tile per worker with geospatial dependencies ready.

Monte Carlo simulation

Run independent simulations across thousands of workers and return tiny aggregate summaries.

What Burla Is Showing Off Here

Capability Where it shows up
Change hardware per call CPU photo scoring in Airbnb, A100 embedding in Wikipedia/arXiv/The Met, CPU-only simulation in Kentucky Derby
Change runtime per function CUDA image for embeddings, GDAL image for raster jobs, BWA/samtools image for genomics
Keep plain Python control flow scripts call remote_parallel_map directly instead of rewriting into Spark, Ray, Airflow, or Kubernetes objects
Put concurrency in the code API limits, Postgres protection, website politeness, and cluster quota control live next to the workload
Stream useful artifacts back generated sites, Parquet shards, vector indexes, JSON outputs, and progress from generator=True
from burla import remote_parallel_map

cpu_results = remote_parallel_map(
    parse_one_file,
    files,
    func_cpu=2,
    func_ram=8,
    max_parallelism=1000,
    generator=True,
)

gpu_vectors = remote_parallel_map(
    embed_one_shard,
    shards,
    func_gpu="A100",
    image="my-cuda-worker:latest",
    max_parallelism=8,
)

api_rows = remote_parallel_map(
    call_one_endpoint,
    request_batches,
    max_parallelism=64,
    generator=True,
)

Links

About

Burla examples, demos, and use cases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors