Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 69 additions & 103 deletions nasdaq/README.md
Original file line number Diff line number Diff line change
@@ -1,155 +1,121 @@
# Replaying the NASDAQ order book

This is an example project live-replaying the complete NASDAQ exchange orders from January 30 2020 with CedarDB.
For an overview of the dataset, take a look at [our example dataset docs](https://cedardb.com/docs/example_datasets/nasdaq/).
This example live-replays the complete NASDAQ order stream from January 30, 2020, with CedarDB. For dataset background, see [the NASDAQ example dataset docs](https://cedardb.com/docs/example_datasets/nasdaq/).

What's especially noteworthy here is that CedarDB is not only running the **transactional query workload**,
inserting thousands of events every 100 ms, but also the **complex analytical queries** which feed the various
views in the Grafana dashboard. It's an excellent illustration of the power of Hybrid Transactional/Analytical
views in the Grafana dashboard. It's an excellent illustration of the power of Hybrid Transactional/Analytical
Processing (HTAP).

This example consists of separate applications:
![Grafana](./grafana.png)

1. A parser written in Python that parses NASDAQ's proprietary ITCHv5 protocol into human-readable CSV files.
2. A C++ client connecting to CedarDB and live-replaying all orders.
3. A Grafana Dashboard displaying live analytics (pictured below).
The setup is fully dockerized. The demo stack contains:

1. `parser`: downloads the NASDAQ ITCH dump and converts it into CSV files.
2. `cedar`: runs CedarDB and stores the parsed data on a Docker volume.
3. `client`: creates the schema, loads reference and pre-market data, and replays the live market stream in 100 ms batches.
4. `grafana`: shows live analytics on top of the replay.
5. `aichat`: optional web UI for natural-language questions over the same database.

![Grafana](./grafana.png)
In comparison mode, the stack also starts PostgreSQL and replays the same workload into both databases.

In addition to Grafana, you can also issue queries yourself to get insight into the market state.
This guide will show you how to do both using `docker compose`.
## Getting started

Prerequisites:

## Getting started
1. Docker with Compose support.
2. A stable internet connection to pull the required Docker images and download the NASDAQ dataset on first run.

This guide assumes you already have a cedardb docker image, i.e. have completed [this guide](https://cedardb.com/docs/getting_started/running_docker_image/) up to step two.
Optional:

### 1. Prepare the data
Execute the `prepare.sh` script:
```shell
./prepare.sh
```
It downloads the raw binary package capture that NASDAQ provides, extracts it and transforms it into CSV files.
This downloads about 3.3 GB and writes ~16 GB CSV files.
1. A CedarDB license at `db-config/cedar/license.env`. You can sign up for a trial at https://console.cedardb.com/signup.

The license is needed to create the dedicated `grafana` database user and grant the required user permissions cleanly. It also enables database statistics in comparison mode.

If no license is present, `demo.sh` falls back to using the `postgres` admin user for Grafana access because the dedicated `grafana` user cannot be granted the required read permissions.

## Run the demo

You should now have a set of files in the data directory containing the stock exchange events:
Use `demo.sh` as the entrypoint for the stack:

```shell
du -h data/*.csv
./demo.sh start
```

```
5,3G data/cancellations.csv
181M data/cancellationsPreMarket.csv
337M data/executions.csv
2,7M data/executionsPreMarket.csv
7,5M data/marketMakers.csv
9,8G data/orders.csv
279M data/ordersPreMarket.csv
516K data/stocks.csv
```
This starts the normal stack in the background with `docker compose up -d --build`. On the first run, the parser container:

1. downloads the NASDAQ archive, about 3.3 GB compressed,
2. extracts it,
3. parses it into roughly 16 GB of CSV data,
4. stores everything in the Docker volume `data`.

Depending on your connection and machine, the initial download and parsing step can take around 10 to 15 minutes.

After the parser finishes, the client loads the schema and pre-market data, then begins the timed replay. The replay starts 10 minutes after market open, so the initial database state corresponds to 9:40 AM market time. If it has been running for 20 minutes, the database state represents 10:00 AM market time.

Useful lifecycle commands:

### 2. Run the application
```shell
docker compose build client
docker compose up
./demo.sh stop # Stop and remove all containers
./demo.sh clean # Stop and remove all containers, then remove Docker volumes, including the parsed dataset
./demo.sh pull # Pull the latest database images
```

While the client is running, it replays the live exchange data in 100ms batches, treating the point in time the program was started as 9:30 AM, i.e. the exact instance the market opens.
In the first minute, the client catches up to the live transaction stream and starts inserting many events.
Afterward, you should get batches of a couple of thousand events per 100ms.
So, if you run the client for 30 minutes, the database state will represent the state of the NASDAQ exchange 30 minutes after market open, i.e., 10:00 AM.
## Access the Dashboard

Grafana is exposed on http://localhost:3000.
Comment thread
victorpacyna marked this conversation as resolved.

You can stop the application via `CTRL+C` followed by `docker compose down`

### 3. Connect to Grafana
You can now browse to Grafana at http://localhost:3000, log in with username `admin` and password `admin`, and view the NASDAQ dashboard.
Authentication is disabled for the UI, so opening the page is enough. The dashboard is provisioned automatically.

![Grafana Instructions](./grafana_instructions.png)

## Access the AI Chat

The AI chat UI is exposed on http://localhost:8080.

### 4. Query the data
Alternatively, you can run your own queries. This requires installation of the `psql` PostgreSQL command line interface.
Note that, for the `Time:` values to appear, you need to either run `\timing on` from within the session or
have a `$HOME/.psqlrc` file containing at least the following line: `\timing on`.
By default, the container starts with:

```shell
PGPASSWORD=postgres psql -h localhost -U postgres -d postgres
OPENROUTER_API_KEY={your_api_key_here}
LLM_MODEL=anthropic/claude-sonnet-4.5
```

Here are some example queries to get you started:
Set `OPENROUTER_API_KEY` before `./demo.sh start` if you want the chat UI to be functional.

```sql
postgres=#
select count(*) from orders;
count
----------
11019259
(1 row)
## Query the data

Time: 5.316 ms
```
The best way to run ad hoc SQL in this setup is through Grafana Explore.
Comment thread
victorpacyna marked this conversation as resolved.

Open http://localhost:3000/explore, select the provisioned PostgreSQL-compatible data source, and run SQL directly there.

Example queries:

```sql
postgres=#
select count(*) from orders;
select avg(price) from executions;
avg
-----------------------------
140.21785151844912886904428
(1 row)

Time: 15.681 ms
```

The following query calculates the new orders created per second averaged over the last 10 seconds.
The following query calculates new orders per second averaged over the last 10 seconds:

```sql
client=#
select count(*) / 10 as new -- averaged over 10 seconds
from orders o
where prevOrder is null -- == new order
and o.timestamp > (select max(e.timestamp) from executions e) - 10::bigint * 1000 * 1000 * 1000; -- averaged over 10 seconds
new
------
8285
(1 row)

Time: 32.514 ms
select count(*) / 10 as new
from orders o
where prevOrder is null
and o.timestamp > (
select max(e.timestamp) from executions e
) - 10::bigint * 1000 * 1000 * 1000;
```

You can find some more complex queries in the `sql` subdirectory.
More analytical queries are available in [`sql/`](./sql).

## Load everything
If you prefer other database tools like psql or DBeaver, expose the database port (5432) to localhost in `compose.yml`/`comparison.compose.yml`.

Start the Docker image, mounting the `./data` directory containing the CSV data:
## Comparison mode

```shell
docker run --rm -p 5432:5432 -e CEDAR_PASSWORD=postgres -v ./data:/data --name cedardb cedardb
```
Comparison mode starts CedarDB and PostgreSQL with the same CPU and memory limits, then replays the same workload into both systems.

Connect to CedarDB via the `psql` CLI:
It requires `DB_CPU_LIMIT` and `DB_MEM_LIMIT`:

```shell
PGPASSWORD=postgres psql -h localhost -U postgres -d postgres
```

Using the `psql` client, run the DDL and then directly copy the CSV data:

```sql
\i client/schema.sql
copy stocks from '/data/stocks.csv' with(format text, delimiter ';', null '', header true);
copy marketmakers from '/data/marketMakers.csv' with(format text, delimiter ';', null '', header true);
copy orders from '/data/ordersPreMarket.csv' with(format text, delimiter ';', null '', header true);
copy orders from '/data/orders.csv' with(format text, delimiter ';', null '', header true);
copy executions from '/data/executionsPreMarket.csv' with(format text, delimiter ';', null '', header true);
copy executions from '/data/executions.csv' with(format text, delimiter ';', null '', header true);
copy cancellations from '/data/cancellationsPreMarket.csv' with(format text, delimiter ';', null '', header true);
copy cancellations from '/data/cancellations.csv' with(format text, delimiter ';', null '', header true);
DB_CPU_LIMIT=4 DB_MEM_LIMIT=8g ./demo.sh --comparison start
```

Try running some ad hoc SQL queries.

Please note that this does not maintain the orderbook, which would be maintained by the client.