Skip to content

satyaaman97/ProjectCTE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CryptoPulse — Real-Time BTC Trading Analytics Pipeline

A low-latency streaming data pipeline that ingests live Bitcoin trade data from Binance, processes it with Apache Spark Structured Streaming, stores aggregated metrics in ClickHouse, and visualises them in Grafana — all running locally via Docker.


Architecture

Binance WebSocket  ──►  ingest_and_producer.py  ──►  Redpanda (Kafka)
                                                           │
                                              consumer.ipynb (PySpark)
                                                           │
                                                      ClickHouse
                                                           │
                                                        Grafana
Layer Technology Role
Data Source Binance Futures WebSocket Live BTC/USDT aggregate trade stream
Message Broker Redpanda (Kafka-compatible) Durable, low-latency message queue
Stream Processing Apache Spark Structured Streaming Per-second windowed aggregations
Storage ClickHouse Columnar OLAP database for time-series metrics
Visualisation Grafana Real-time dashboards
Infrastructure Docker Compose One-command local environment

Pipeline Details

Stage 1 — Ingest & Produce (ingest_and_producer.py)

  • Connects to the Binance Futures WebSocket (wss://fstream.binance.com/ws/btcusdt@aggTrade)
  • Transforms raw tick data: extracts symbol, price, quantity, timestamp (IST), and side (BUY/SELL)
  • Implements backpressure via a bounded asyncio.Queue (max 10,000 items) — drops oldest record when full to preserve low latency
  • Publishes clean JSON messages to the trades_btc Redpanda topic

Stage 2 — Consume & Process (consumer.ipynb)

PySpark reads from Redpanda and computes 1-second tumbling window aggregations:

Metric Description
vwap Volume-Weighted Average Price
net_delta Net signed volume (BUY qty − SELL qty)
buy_pressure_pct % of volume that was buying
total_volume Total BTC traded in the window
trade_count Number of individual trades
avg_price_1_sec Simple average price
signal Rule-based signal: STRONG BUY / STRONG SELL / ACCUMULATING / DISTRIBUTING / NEUTRAL

Signal logic:

avg_price > VWAP AND net_delta > 0  →  STRONG BUY
avg_price < VWAP AND net_delta < 0  →  STRONG SELL
avg_price < VWAP AND net_delta > 0  →  ACCUMULATING (Potential Buy)
avg_price > VWAP AND net_delta < 0  →  DISTRIBUTING (Potential Sell)
otherwise                            →  NEUTRAL

Results are written to the pulse_metrics table in ClickHouse via the native ClickHouse-Spark connector.

Stage 3 — Visualise (Grafana)

Grafana connects to ClickHouse using the grafana-clickhouse-datasource plugin and renders real-time dashboards from the pulse_metrics table.


Grafana Dashboards

Add your dashboard screenshots to the docs/ folder and reference them below.


Quick Start

Prerequisites

  • Docker & Docker Compose
  • Python 3.10+
  • Apache Spark 3.5 (with SPARK_HOME set)
  • Java 11+

1. Start infrastructure

docker compose up -d

Services:

Service URL
Redpanda Console http://localhost:8080
ClickHouse HTTP http://localhost:8123
Grafana http://localhost:3000

2. Create the ClickHouse table

Open the ClickHouse client or the Play UI at http://localhost:8123/play and run:

CREATE TABLE IF NOT EXISTS default.pulse_metrics (
    window_start DateTime,
    window_end   DateTime,
    vwap         Float64,
    net_delta    Float64,
    buy_pressure_pct Float64,
    total_volume Float64,
    trade_count  Int64,
    avg_price_1_sec Float64,
    signal       String
) ENGINE = MergeTree()
ORDER BY window_start;

3. Start the producer

pip install websockets aiokafka
python ingest_and_producer.py

4. Start the Spark consumer

Open and run all cells in consumer.ipynb (Jupyter or VS Code).

5. Open Grafana

  • Navigate to http://localhost:3000 (default credentials: admin / admin)
  • Add ClickHouse as a data source (host: clickhouse, port: 8123)
  • Import the dashboard JSON from grafana/dashboards/ (see below)

Project Structure

ProjectCTE/
├── docker-compose.yml          # Redpanda, ClickHouse, Grafana services
├── ingest_and_producer.py      # Binance WebSocket → Redpanda producer
├── consumer.ipynb              # PySpark Structured Streaming consumer
├── grafana/
│   └── dashboards/             # Exported Grafana dashboard JSON files
└── docs/                       # Screenshots and diagrams

Key Design Decisions

  • Redpanda over Kafka: Kafka-compatible but single-binary, no ZooKeeper — ideal for local dev
  • Bounded queue with drop-oldest backpressure: Ensures the producer never blocks on slow consumers; latency is preferred over completeness for a trading feed
  • ClickHouse for aggregations: Columnar storage with sub-second query times on time-series data; native Spark connector avoids JDBC overhead
  • 1-second watermark: 2-second watermark allows for late data while keeping Grafana panels near real-time

Dependencies

Package Version
websockets ≥ 12
aiokafka ≥ 0.10
pyspark 3.5.x
clickhouse-spark-runtime 0.10.0 (Spark 3.5, Scala 2.12)
spark-sql-kafka 3.5.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors