aimm

Configuration

Create a configuration file:

cp .env.example .env

Fill in and adjust configuration in .env as needed (e.g., ports, passwords).
Start databases using Docker Compose:

docker compose up -d

Python stuff

Create virtual environment (should be done only once):

python -m venv .venv

Activate virtual environment (needs to be done every new terminal session - python sucks):

source .venv/bin/activate

Install dependencies (whenever they changes). Also needed for editable install:

pip install -e .

Run scripts:

python -m path.to.file

Workflow

Training latency estimation models requires data. For that, there is a whole pipeline of scripts. The general idea is to reuse the former steps as much as possible and repeat them only when necessary (their correspondent code or the previous steps have changed).

Generally, the following scripts work for all databases (postgres, mongo, neo4j). This guide uses postgres as an example, but feel free to replace it with the database of your choice.

Data generation and population

Choose a schema (e.g., art and edbt) and a scale (a number >= 0). Use a small scale (e.g., 0) for testing. Then continue with larger scales for experiments. Generally, the data grows like 2^scale, so be careful.

Generate data:

python -m scripts.pipeline.generate_data art-0
python -m scripts.pipeline.generate_data edbt-0
python -m scripts.pipeline.generate_data tpch-0

Populate databases:

python -m scripts.pipeline.populate_db postgres/art-0
python -m scripts.pipeline.populate_db postgres/edbt-0
python -m scripts.pipeline.populate_db postgres/tpch-0

Query measurement and plan extraction

Run the following to generate queries, measure their latencies and extract their plans:

python -m scripts.pipeline.measure_queries postgres/art-0 --num-queries 1000 --num-runs 10
python -m scripts.pipeline.measure_queries postgres/edbt-0 --num-queries 100 --num-runs 10

Adjust the number of queries and runs as needed.

Dataset preparation

Choose a cool dataset name (e.g., train_dataset) and run the following:

python -m scripts.neural.create_dataset postgres/train_dataset art-0/measured-1000-10.jsonl

Multiple schema-scale combinations should be combined into a single dataset. So, if you also generated data for art-2 with 200 queries and 20 runs, you should add art-2/measured-200-20.jsonl to the command above.
A validation dataset (e.g., val_dataset) should reuse the feature extractor from the train dataset. Use:

python -m scripts.neural.create_dataset postgres/val_dataset --feature-extractor-dataset train_dataset edbt-1/measured-100-10.jsonl

Training and evaluation

Choose a fitting model name (e.g., fitting_model). Then, run the following:

python -m scripts.neural.train postgres/fitting_model train_dataset val_dataset

To eavaluate a model, choose a checkpoint (e.g., best, epoch/10, ...) and a test dataset (e.g., val_dataset) and run:

python -m scripts.neural.test postgres/fitting_model/best val_dataset

PostgreSQL flat-feature tree models

PostgreSQL also has an alternative latency-estimation path based on fixed-length features from EXPLAIN plans and scikit-learn style tree regressors. This path coexists with the QPP-Net workflow above and does not require EXPLAIN ANALYZE at inference time. Table and index names are excluded by default so models can be evaluated across schemas more easily; pass --include-schema-identifiers when you intentionally want schema-specific relation/index features.

Create flat train/validation datasets from measured PostgreSQL queries:

python -m scripts.flat.postgres.create_dataset postgres/tpch-2-flat-train tpch-2/measured-1000-2.jsonl --val-dataset postgres/tpch-2-flat-val --val-ratio 0.2 --split-seed 69

Train a random forest:

python -m scripts.flat.postgres.train postgres/tpch-2-flat-rf tpch-2-flat-train tpch-2-flat-val --model-type random_forest

Or train an XGBoost regressor:

python -m scripts.flat.postgres.train postgres/tpch-2-flat-xgb tpch-2-flat-train tpch-2-flat-val --model-type xgboost

Evaluate on a flat dataset:

python -m scripts.flat.postgres.test postgres/tpch-2-flat-rf edbt-2-flat

Predict latency for a new query using plain EXPLAIN only:

python -m scripts.flat.postgres.predict postgres/tpch-2-flat-rf postgres/tpch-2 "SELECT * FROM orders LIMIT 10"

MongoDB flat-feature tree models

MongoDB flat models use only global_stats and queryPlanner plans for prediction. executionStats and Python query timing are used only by scripts.pipeline.measure_queries when collecting training labels.

# Create train/validation flat datasets from measured MongoDB queries.
python -m scripts.flat.mongo.create_dataset mongo/tpch-2-flat-train tpch-2/measured-1000-5.jsonl --val-dataset mongo/tpch-2-flat-val --val-ratio 0.2 --split-seed 42 --refresh-queryplanner

# Create an EDBT test set with the training feature vocabulary.
python -m scripts.flat.mongo.create_dataset mongo/edbt-3-flat edbt-3/measured-1000-5.jsonl --feature-extractor-dataset tpch-2-flat-train --refresh-queryplanner

# Train and evaluate tree models.
python -m scripts.flat.mongo.train mongo/tpch-2-flat-rf tpch-2-flat-train tpch-2-flat-val --model-type random_forest
python -m scripts.flat.mongo.test mongo/tpch-2-flat-rf edbt-3-flat

# Predict a single query without executing it.
python -m scripts.flat.mongo.predict mongo/tpch-2-flat-rf mongo/tpch-2 '{"find":"orders","filter":{"o_orderkey":1}}'

Neo4j flat-feature tree models

Neo4j flat models use only plain EXPLAIN plan fields for prediction: estimated rows, operator structure, planner/runtime metadata, identifiers, and operator Details patterns. PROFILE and timed query execution are allowed only while collecting training labels; runtime counters such as rows, db hits, page-cache hits, and operator time are not used as flat-model features.

# Create train/validation flat datasets from measured Neo4j queries.
python -m scripts.flat.neo4j.create_dataset neo4j/tpch-2-flat-train tpch-2/measured-1000-5.jsonl --val-dataset neo4j/tpch-2-flat-val --val-ratio 0.2 --split-seed 69

# Create an EDBT test set with the training feature vocabulary.
python -m scripts.flat.neo4j.create_dataset neo4j/edbt-3-flat edbt-3/measured-1000-10.jsonl --feature-extractor-dataset tpch-2-flat-train

# Train and evaluate tree models.
python -m scripts.flat.neo4j.train neo4j/tpch-2-flat-rf tpch-2-flat-train tpch-2-flat-val --model-type random_forest
python -m scripts.flat.neo4j.test neo4j/tpch-2-flat-rf edbt-3-flat

# Predict a single query without executing it.
python -m scripts.flat.neo4j.predict neo4j/tpch-2-flat-rf neo4j/tpch-2 "MATCH (n) RETURN n LIMIT 10"

Experiments

Explain query plan:

python -m scripts.show_plan postgres/tpch-1 "UPDATE orders SET o_totalprice = 0 WHERE o_orderkey = 1"
python -m scripts.show_plan postgres/tpch-1 basic-0

python -m experiments check
python -m experiments test postgres -c data/checkpoints/tpch_postgres_best.pt

Development

Use this to find all available operators and then find the missing ones:

python -m latency_estimation.neo4j train --dry-run
python -m scripts.show_plan neo4j edbt --all-queries

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
data		data
dynamic		dynamic
experiments		experiments
scripts		scripts
src		src
.editorconfig		.editorconfig
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
compose.prod.yaml		compose.prod.yaml
compose.yaml		compose.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aimm

Configuration

Python stuff

Workflow

Data generation and population

Query measurement and plan extraction

Dataset preparation

Training and evaluation

PostgreSQL flat-feature tree models

MongoDB flat-feature tree models

Neo4j flat-feature tree models

Experiments

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

aimm

Configuration

Python stuff

Workflow

Data generation and population

Query measurement and plan extraction

Dataset preparation

Training and evaluation

PostgreSQL flat-feature tree models

MongoDB flat-feature tree models

Neo4j flat-feature tree models

Experiments

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages