A low-latency streaming data pipeline that ingests live Bitcoin trade data from Binance, processes it with Apache Spark Structured Streaming, stores aggregated metrics in ClickHouse, and visualises them in Grafana — all running locally via Docker.
Binance WebSocket ──► ingest_and_producer.py ──► Redpanda (Kafka)
│
consumer.ipynb (PySpark)
│
ClickHouse
│
Grafana
| Layer | Technology | Role |
|---|---|---|
| Data Source | Binance Futures WebSocket | Live BTC/USDT aggregate trade stream |
| Message Broker | Redpanda (Kafka-compatible) | Durable, low-latency message queue |
| Stream Processing | Apache Spark Structured Streaming | Per-second windowed aggregations |
| Storage | ClickHouse | Columnar OLAP database for time-series metrics |
| Visualisation | Grafana | Real-time dashboards |
| Infrastructure | Docker Compose | One-command local environment |
- Connects to the Binance Futures WebSocket (
wss://fstream.binance.com/ws/btcusdt@aggTrade) - Transforms raw tick data: extracts
symbol,price,quantity,timestamp(IST), andside(BUY/SELL) - Implements backpressure via a bounded
asyncio.Queue(max 10,000 items) — drops oldest record when full to preserve low latency - Publishes clean JSON messages to the
trades_btcRedpanda topic
PySpark reads from Redpanda and computes 1-second tumbling window aggregations:
| Metric | Description |
|---|---|
vwap |
Volume-Weighted Average Price |
net_delta |
Net signed volume (BUY qty − SELL qty) |
buy_pressure_pct |
% of volume that was buying |
total_volume |
Total BTC traded in the window |
trade_count |
Number of individual trades |
avg_price_1_sec |
Simple average price |
signal |
Rule-based signal: STRONG BUY / STRONG SELL / ACCUMULATING / DISTRIBUTING / NEUTRAL |
Signal logic:
avg_price > VWAP AND net_delta > 0 → STRONG BUY
avg_price < VWAP AND net_delta < 0 → STRONG SELL
avg_price < VWAP AND net_delta > 0 → ACCUMULATING (Potential Buy)
avg_price > VWAP AND net_delta < 0 → DISTRIBUTING (Potential Sell)
otherwise → NEUTRAL
Results are written to the pulse_metrics table in ClickHouse via the native ClickHouse-Spark connector.
Grafana connects to ClickHouse using the grafana-clickhouse-datasource plugin and renders real-time dashboards from the pulse_metrics table.
Add your dashboard screenshots to the
docs/folder and reference them below.
- Docker & Docker Compose
- Python 3.10+
- Apache Spark 3.5 (with
SPARK_HOMEset) - Java 11+
docker compose up -dServices:
| Service | URL |
|---|---|
| Redpanda Console | http://localhost:8080 |
| ClickHouse HTTP | http://localhost:8123 |
| Grafana | http://localhost:3000 |
Open the ClickHouse client or the Play UI at http://localhost:8123/play and run:
CREATE TABLE IF NOT EXISTS default.pulse_metrics (
window_start DateTime,
window_end DateTime,
vwap Float64,
net_delta Float64,
buy_pressure_pct Float64,
total_volume Float64,
trade_count Int64,
avg_price_1_sec Float64,
signal String
) ENGINE = MergeTree()
ORDER BY window_start;pip install websockets aiokafka
python ingest_and_producer.pyOpen and run all cells in consumer.ipynb (Jupyter or VS Code).
- Navigate to
http://localhost:3000(default credentials:admin / admin) - Add ClickHouse as a data source (host:
clickhouse, port:8123) - Import the dashboard JSON from
grafana/dashboards/(see below)
ProjectCTE/
├── docker-compose.yml # Redpanda, ClickHouse, Grafana services
├── ingest_and_producer.py # Binance WebSocket → Redpanda producer
├── consumer.ipynb # PySpark Structured Streaming consumer
├── grafana/
│ └── dashboards/ # Exported Grafana dashboard JSON files
└── docs/ # Screenshots and diagrams
- Redpanda over Kafka: Kafka-compatible but single-binary, no ZooKeeper — ideal for local dev
- Bounded queue with drop-oldest backpressure: Ensures the producer never blocks on slow consumers; latency is preferred over completeness for a trading feed
- ClickHouse for aggregations: Columnar storage with sub-second query times on time-series data; native Spark connector avoids JDBC overhead
- 1-second watermark: 2-second watermark allows for late data while keeping Grafana panels near real-time
| Package | Version |
|---|---|
websockets |
≥ 12 |
aiokafka |
≥ 0.10 |
pyspark |
3.5.x |
clickhouse-spark-runtime |
0.10.0 (Spark 3.5, Scala 2.12) |
spark-sql-kafka |
3.5.0 |