Labelled Subgraph Query Benchmark (LSQB)

📄 LSQB: A Large-Scale Subgraph Query Benchmark, GRADES-NDA'21 paper (presentation)

Overview

A benchmark for subgraph matching but with type information (vertex and edge types). The primary goal of this benchmark is to test the query optimizer (join ordering, choosing between binary and n-ary joins) and the execution engine (join performance, support for worst-case optimal joins) of graph databases. Features found in more mature database systems and query languages such as date/string operations, query composition, complex aggregates/filters are out of scope for this benchmark.

The benchmark consists of the following 9 queries:

Inspirations and references:

Getting started

Install dependencies

Install Docker on your machine.
(Optional) Change the location of Docker's data directory (instructions).
Install the required dependencies:
```
scripts/install-dependencies.sh
```
(Optional) Install "convenience packages" (e.g. vim, ag, etc.).
```
scripts/install-convenience-packages.sh
```
(Optional) Add the Umbra binaries as described in the umb/README.md file.
(Optional) "Warm up" the system using scripts/benchmark.sh, e.g. run all systems through the smallest example data set. This should fill Docker caches.

(Optional) Copy the data sets to the server. To decompress and delete them, run:

for f in social-network-sf*.tar.zst; do echo ${f}; tar -I zstd -xvf ${f}; rm ${f}; done

Revise the benchmark settings, e.g. the number of threads for DuckDB.

Creating the input data

Data sets should be provided in two formats:

data/social-network-sf${SF}-projected-fk: projected foreign keys, the preferred format for most graph database management systems.
data/social-network-sf${SF}-merged-fk: merged foreign keys, the preferred format for most relational database management systems.

An example data set is provided with the substitution SF=example:

data/social-network-sfexample-projected-fk
data/social-network-sfexample-merged-fk

Pre-generated data sets are available in the SURF/CWI data repository.

To download the data sets, set the MAX_SF environment variable to the size of the maximum scale factor you want to use (at least 1) and run the download script.

For example:

export MAX_SF=3
scripts/download-projected-fk-data-sets.sh
scripts/download-merged-fk-data-sets.sh

For more information, see the download instructions and links.

Generating the data sets from scratch

You can generate your own data sets. Note that these may differ in size for different versions of the data generator – for publications, it's recommended to use the pre-generated data sets linked above.

Run the LDBC Sparj Datagen using CSV outputs and raw mode (see its README for instructions).

Use the scripts in the converter repository:

cd out/csv/raw/composite_merge_foreign/
export DATAGEN_DATA_DIR=`pwd`

Go to the data converter repository:

./spark-concat.sh ${DATAGEN_DATA_DIR}
./load.sh ${DATAGEN_DATA_DIR} --no-header
./transform.sh
cat export/snb-export-only-ids-projected-fk.sql | ./duckdb ldbc.duckdb
cat export/snb-export-only-ids-merged-fk.sql    | ./duckdb ldbc.duckdb

Copy the generated files:

export SF=1
cp -r data/csv-only-ids-projected-fk/ ~/git/snb/lsqb/data/social-network-sf${SF}-projected-fk
cp -r data/csv-only-ids-merged-fk/    ~/git/snb/lsqb/data/social-network-sf${SF}-merged-fk

Running the benchmark

The following implementations are provided. The 🐳 symbol denotes that the implementation uses Docker.

Stable implementations:

umb: Umbra [SQL] 🐳
hyp: HyPer [SQL] 🐳
ddb: DuckDB [SQL] (embedded)
pos: PostgreSQL [SQL] 🐳
mys: MySQL [SQL] 🐳
neo: Neo4j Community Edition [Cypher] 🐳
red: RedisGraph [Cypher] 🐳
mem: Memgraph [Cypher] 🐳
vos: Virtuoso Open-Source Edition [SPARQL] 🐳
xgt: Trovares xGT [Cypher] 🐳
rdm: RapidMatch [.graph] (separate process)

WIP implementations:

kuz: Kùzu [Cypher] (embedded)

⚠️ Both Neo4j and Memgraph use the Bolt protocol for communicating with the client. To avoid clashing on port 7687, the Memgraph instance uses port 27687 for its Bolt communication. Note that the two systems use different Bolt versions so different client libraries are necessary.

Running the benchmark

The benchmark run consists of two key steps: loading the data and running the queries on the database.

Some systems need to be online before loading, while others need to be offline. To handle these differences in a unified way, we use three scripts for loading:

pre-load.sh: steps before loading the data (e.g. starting the DB for systems with online loaders)
load.sh: loads the data
post-load.sh: steps after loading the data (e.g. starting the DB for systems with offline loaders)

The init-and-load.sh script calls these three scripts (pre-load.sh, load.sh, and post-load.sh). Therefore, to run the benchmark and clean up after execution, use the following three scripts:

init-and-load.sh: initialize the database and load the data
run.sh: runs the benchmark
stop.sh: stops the database

Example usage that loads scale factor 0.3 to Neo4j:

cd neo
export SF=0.3
./init-and-load.sh && ./run.sh && ./stop.sh

Example usage that runs multiple scale factors on DuckDB. Note that the SF environment variable needs to be exported.

cd ddb
export SF
for SF in 0.1 0.3 1; do
   ./init-and-load.sh && ./run.sh && ./stop.sh
done

Cross-validation

Used the cross-validate.sh script. For example:

scripts/cross-validate.sh --system DuckDB --variant "10 threads" --scale_factor 1
scripts/cross-validate.sh --system Neo4j --scale_factor 0.1
scripts/cross-validate.sh --system PostgreSQL --scale_factor example

Philosophy

This benchmark has been inspired by the LDBC SNB and the JOB benchmarks.
First and foremost, this benchmark is designed to be simple. In the spirit of this, we do not provide auditing guidelines – it's the user's responsibility to ensure that the benchmark setup is meaningful. We do not provide a common Java/Python driver component as the functionality required by the driver is very simple and can be implemented by users ideally in less than an hour.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Labelled Subgraph Query Benchmark (LSQB)

Overview

Getting started

Install dependencies

Creating the input data

Generating the data sets from scratch

Running the benchmark

Running the benchmark

Cross-validation

Philosophy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 636 Commits
.circleci		.circleci
avg		avg
cypher		cypher
data		data
ddb		ddb
dgr		dgr
expected-output		expected-output
hyp		hyp
kuz		kuz
mdb		mdb
mem		mem
mys		mys
neo		neo
pos		pos
rdm		rdm
red		red
results		results
scripts		scripts
sparql		sparql
sql		sql
umb		umb
vos		vos
xgt		xgt
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
patterns.png		patterns.png

Folders and files

Latest commit

History

Repository files navigation

Labelled Subgraph Query Benchmark (LSQB)

Overview

Getting started

Install dependencies

Creating the input data

Generating the data sets from scratch

Running the benchmark

Running the benchmark

Cross-validation

Philosophy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages