A low-latency order book and matching engine in C++17. Processes 7.2 million orders/sec at 138ns average latency.
This is the core data structure inside every stock exchange — it takes buy/sell orders, matches them by price-time priority, and outputs trades.
Orders In ──> Feed Handler ──> Matching Engine ──> Order Book ──> Trades Out
(stdin/binary) (strategy) (bids + asks) (async logger)
| Workload | Throughput | Avg Latency |
|---|---|---|
| 1.29M orders | 7.22M orders/sec | 138 ns |
| 2.57M orders | 4.85M/sec | 206 ns |
| 3.86M orders | 4.33M/sec | 230 ns |
| 5.14M orders | 3.56M/sec | 280 ns |
Binary order loading: ~170x faster than text parsing (15ms vs 2.4s for 1.29M orders).
All benchmarks on Apple Silicon, macOS, single-threaded matching.
- Order types: LIMIT, MARKET, CANCEL
- Price-time priority: best price first, FIFO within each price level
- Partial fills: large orders match across multiple price levels
- Pluggable matching strategies: compile-time strategy pattern (CRTP) — swap in pro-rata, etc.
- Async trade logging: producer-consumer pattern with background writer thread (mutex + condition_variable)
- Binary protocol: orders loaded as raw structs, zero parsing overhead
Requires a C++17 compiler with pthreads support.
make # builds to build/microexchange
make clean # removes build artifactsCompiler flags: -O3 -flto -pthread (-flto inlines Order accessors across translation units — measured ~2x gain).
# Type orders manually
./build/microexchange
# Pipe from a file
./build/microexchange < data/order.txtText input format (space-separated):
orderId amount type companyId quantity side
1 150.0 LIMIT 17 100 SELL
2 148.0 LIMIT 17 50 SELL
3 151.0 LIMIT 17 120 BUY
For benchmarking, orders can be loaded as raw binary (32-byte structs read directly into memory — no parsing).
python3 benchmark/generate_orders.py 1000000 > data/order_1m.txtmicroexchange/
├── Makefile
├── PROGRESS.md # detailed optimization journal with all benchmarks
├── SPEC.md # original design spec
├── src/
│ ├── main.cpp # entry point, stdin parsing, benchmarking
│ ├── Order.h / .cpp # Order struct (24B, packed, trivially copyable)
│ ├── Orderbook.h / .cpp # order book (sorted vector of PriceLevels)
│ ├── PriceTimeStrategy.h / .cpp # FIFO price-time matching
│ ├── MatchingStrategy.h # strategy interface
│ ├── MatchingEngine.h # routes orders, delegates to strategy
│ ├── Trade.h # trade output struct
│ ├── TradeLogger.h / .cpp # async file writer (background thread)
├── benchmark/
│ ├── generate_orders.py # random order generator
│ └── results/ # all benchmark runs preserved
└── data/ # test order files (text + binary)
class Order {
double amount; // 8B — limit price
int orderId; // 4B
int quantity; // 4B
uint16_t companyId; // 2B — symbol as integer ID (not string)
Ordertype type; // 1B — uint8_t enum
Side side; // 1B — uint8_t enum
}; // 20B + 4B padding = 24BFields ordered by alignment to eliminate internal padding. std::string replaced with uint16_t (24B → 2B). Enums use uint8_t storage (4B → 1B each).
Each side (bids/asks) is a sorted vector of PriceLevels. Each PriceLevel holds a contiguous vector<Order> with a head index — front removal is O(1) (increment index, no data movement). Best price is at back() so removing an empty level is pop_back() — also O(1).
Compile-time strategy pattern. PriceTimeStrategy walks from best price inward, matching FIFO within each level. Trades are emitted via an inlined sink — no heap allocation per order.
Background thread with mutex + condition_variable. The matching thread pushes trades into a shared deque; the logger thread wakes, swaps the queue in O(1), releases the lock, then writes to disk. The matching thread never blocks on I/O.
Started at 431K orders/sec with cout in the hot path. Reached 7.2M orders/sec through 9 measured iterations. Key wins:
| What worked | Why |
|---|---|
| Remove I/O from hot path | cout costs 1-5μs per call |
| Shrink struct (64→24B) | 2-3 orders per cache line instead of 1 |
| Contiguous storage (vector+head) | Cache-friendly, no scattered heap chunks |
| Reusable trade buffer | Eliminated per-order malloc |
| Binary protocol | Zero parsing — read() directly into struct |
| What didn't work | Why |
|---|---|
| map → sorted vector (level container) | ~30 active levels, lookup wasn't hot |
| CRTP (drop virtual dispatch) | Virtual call is ~2-5ns — negligible |
| Lock-free logger attempts | Mutex uncontended, already ~20ns |
Full details with all benchmark data in PROGRESS.md.
MIT