DB v4 progress by ljeub-pometry · Pull Request #2385 · Pometry/Raphtory

ljeub-pometry · 2025-11-19T09:10:21Z

What changes were proposed in this pull request?

Progress towards the new version of the underlying storage

Why are the changes needed?

Does this PR introduce any user-facing change? If yes is this documented?

How was this patch tested?

Are there any further changes required?

…t` (#2357)

…acuum

…updates

…be preserved

…in the parquet loaders

…nctions

* Updating Parquet* structs to support manually passed export vids, eids, and layer_ids * Allowed IDs to be passed to parquet serialization. Will allow us to pre-compute new IDs and turn them into RecordBatches * Changed Parquet encoding to take GraphView instead of GraphStorage. Lock the graph to get parallel iterators over edges. We filter to respect GraphView filtering behaviour. * Fixed node and edge parallel iterator creation * Making the parquet encoders generic over the writer (now sink). We still use ArrowWriter<File> for now, but we will add support for loading into a graph * Changed Parquet writer from ArrowWriter to generic sink for nodes, edges, and graph. * Fixed possible ParquetDelEdge layer_id and layer_name issues by calling explode_layers() on each EdgeView. * Fixed path error * Made all the encode_* functions generic over the sink. A sink factory function can now be passed to these functions to determine how the sinks will be created. This will allow us to pass a sink which is a crossbeam_channel to send RecordBatches elsewhere. * Adding Receiver side on materialize * Hid new materialize behind IO feature and added a test to test the new materialize function * Adding logic to ingest data using load_*_from_df functions * Fixed deadlock. It had to do with LayerMappers being shared between edge_meta and node_meta. * Removed unused variable bindings * Fixed deadlock caused by DictMapper deep_clone not creating a new lock and reusing the old one. * Working on making materialize stream RecordBatches properly instead of encoding everything and then ingesting everything (which would keep everything in memory at once). * Changed std::thread::scope for a rayon::scope * Added a test that times the old and new materialize functions * Debugging materialize_using_recordbatches to see why it freezes/hangs when run on a big graph. * Changed to make encoding using its own thread pool and ingestion use another thread pool. * Switched materialize test to use graph paths and have disk backed storage so that it doesn't run out of memory * Improved ingestion time on the "load_*_from_df" path by avoiding rescanning each segment for each row. Now using this path in the new materialize_using_recordbatches function. * Switched assert_graph_equals to be parallel instead of multi-threaded as much as possible * Rustfmt * Use graph_equals instead of our custom GraphSummary. Update tests to separate out running materialize and parquet decoding. Test using SF10 for now. * Set up environment variables to configure database properly before materialize test. * Added Jemalloc * Removed some unnecessary #[cfg(feature = "io")] gates. Use constants for parquet encoded column names. * Added a test to time loading SF10 dumped parquet files using the df_loader functions * Brought zips back in df_loaders/edges.rs for passing data such as vids, eids, flags, etc... to helper functions * Removed flushing of graph before ingesting RecordBatch in df_loaders. General cleanup * Removed unused imports, changed jemalloc to only be used on MacOS, and changed std::thread::scope for rayon::scope. * Moving df_loaders out of io feature * Move LOAD_POOL out of "io" feature * Move ENCODE_POOL out of "io" feature * Removing some #[cfg(feature = "io")] gates related to materialize_using_recordbatches * Moved folder from serialise::parquet out of serialise folder (so out of "io" feature). Added serialise::parquet.rs file for everything that couldn't be moved out because it depended on dependencies from io feature. * Fixed feature gating behind io and progress * Moved SNB SF1, SF3, SF10 tests to their own separate file * Added test for a filtered graph * Renamed parquet folder to parquet_encoder * Fixed encoders to pass relevant information in NodesT, EdgesC, and EdgesT. This includes Node GIDs and node types. Propagated changes to materialize_using_recordbatches. Filtered test passes. * Lower channel size * Fixes after merge * Fixed test * Fixed io feature gating * Added layer creation before creating the temporal graph to ensure empty layers are created. * Updated edges iteration in parquet encoders so that EIDs get resolved compacted for each layer. This saves a lot of disk space when saved to a directory. * Clean up after filtered sf1 test * No need to set the env vars for raphtory settings, they are imported and copied from the graph on disk. * Added layer names to the parquet files to avoid filename collision when creating the arrow writer for parquet encoding. * Cleaned up test_materialize.rs imports * Switched old materialize for the new one to run tests * Fix bug in resolve_node_and_meta_for_node_col where nodes were not being resolved, only looked up, which was causing failures related to metadata not being added for nodes that haven't already been resolved. * Materialize edge deletions before edge c props (edge metadata) to fix materialization bug regarding persistent graphs * Attempting to fix temporal properties not being serialized properly on persistent graphs * Got rid of layer_n in parquet filenames. They were causing problems with ordering of parquet files when loading data. Instead, we now have atomically incrementing counters for file ids. * Preserve property mappers in materialize * Fix bugs in materialize. Switch rayon::scope for std::thread::scope to avoid a deadlock when the scope's num_threads is 1. Removed resizing of segments to the max eid to avoid empty segments when a graph is filtered. This was leading to empty graph errors. * Remove sf3 paths in test_materialize_sf10.rs, * Remove channel for producer in materialize * Added flag to resolve nodes when materializing in load_node_props_from_df, and internalise otherwise * First try at is_materializing flag in load_node_props_from_df * Fixed test_materialize_sf10.rs feature gating on imports * Added t_len for NodeStorageInner * Clean up imports a lil * Fix normalise_temporal_map not properly defining a stable deterministic ordering for events at the same timestamp for Prop::List (Vec and Array should be the same) and Prop::Map (ordering of elements should be stable, previously depended on HashMap iteration order which is undefined). * Added edge.properties().temporal().iter_ids() and used it in the serialization of ParquetTEdge. Cleaned up materialize tests so that they don't try to call an "old" materialize anymore * Clean up test file * Get rid of old materialize * Revert edge endpoint VID parquet column names to "rap_src_id" and "rap_dst_id". GIDS are now "rap_src_id" and "rap_dst_id". This is inconsistent with other column's naming scheme, but it is backwards compatible with already encoded parquet files. * Changing parquet column names so they're consistent * Update parquet files --------- Co-authored-by: Lucas Jeub <lucas.jeub@pometry.com>

* add edge id to test query to make sure the sorting works (test should not depend on the order of edges) * add sorting for neighbour ids

* make sure we cancel all tasks when the running server is dropped * update optd * add domain for NodeOp * avoid unnecessarily re-filtering the domain when it is correct * changes to better support Bn edge sized graphs fixing last compile error track count temporal edges * remove accidental pyo3 import * small import updates * should call list_filtered in nodes * const_value_in_domain should be the same as const_value by default * possible improvements to UI for very large graphs * still need to check that the edge exists in the layer, even if we have the edge ref already * no optimisation in with_debug as they make debugging more annoying * filtering by node is really bad for window so change this back * fix materialize double-adding temporal edges * for a persistent graph the update history and properties for exploded edges are not the same * need to look at explode() for history on persistent graphs * attempt at faster node_valid * include updates from static graph in node_valid check for layers * cleanup * fix search feature * make component test easier to debug on failure * add our own union find implementation based on the old connected components algorithm (maybe can be optimised but at least it seems correct) * clean up dependencies * storage dependency is definitely used * avoid compiling the vectors feature in benchmarks unless it is actually needed * implement has_layer_inner directly * optimise last for filtered additions * add fast path for getting edge ref out again * attempt to optimise SVM * use optimised active check * some inlines * minimise the size of the MemEdgeRef while still including src/dst information * add src/dst to MemEdgeEntry as well * remove sorted_vector_map dependency and clean up * no real reason to capture src/dst on the MemEdgeRef/MemEdgeEntry as these should be cheap to look up * fix subgraph filtering * chore: apply tidy-public auto-fixes * more optimisations for windowing * cleanup * remove dbg * when working with disk storage, in-memory references don't always exist * minor cleanup * bring num_nodes up to speed * more fixes for layered graphs * replace some kmerge with fast_merge * more optimisations for windowing * add check for filtering that excludes layer --------- Co-authored-by: Ben Steer <b.a.steer@qmul.ac.uk> Co-authored-by: Fabian Murariu <murariu.fabian@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* make sure we cancel all tasks when the running server is dropped * update optd * add domain for NodeOp * avoid unnecessarily re-filtering the domain when it is correct * changes to better support Bn edge sized graphs fixing last compile error track count temporal edges * remove accidental pyo3 import * small import updates * should call list_filtered in nodes * const_value_in_domain should be the same as const_value by default * possible improvements to UI for very large graphs * still need to check that the edge exists in the layer, even if we have the edge ref already * no optimisation in with_debug as they make debugging more annoying * filtering by node is really bad for window so change this back * fix materialize double-adding temporal edges * for a persistent graph the update history and properties for exploded edges are not the same * need to look at explode() for history on persistent graphs * attempt at faster node_valid * include updates from static graph in node_valid check for layers * cleanup * fix search feature * make component test easier to debug on failure * add our own union find implementation based on the old connected components algorithm (maybe can be optimised but at least it seems correct) * clean up dependencies * storage dependency is definitely used * avoid compiling the vectors feature in benchmarks unless it is actually needed * implement has_layer_inner directly * optimise last for filtered additions * add fast path for getting edge ref out again * attempt to optimise SVM * use optimised active check * some inlines * minimise the size of the MemEdgeRef while still including src/dst information * add src/dst to MemEdgeEntry as well * remove sorted_vector_map dependency and clean up * no real reason to capture src/dst on the MemEdgeRef/MemEdgeEntry as these should be cheap to look up * fix subgraph filtering * chore: apply tidy-public auto-fixes * more optimisations for windowing * cleanup * remove dbg * when working with disk storage, in-memory references don't always exist * minor cleanup * bring num_nodes up to speed * more fixes for layered graphs * replace some kmerge with fast_merge * more optimisations for windowing * add check for filtering that excludes layer * make list properties always return numpy arrays --------- Co-authored-by: Ben Steer <b.a.steer@qmul.ac.uk> Co-authored-by: Fabian Murariu <murariu.fabian@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Test node types are served correctly by the server * Run fmt * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add read only version of graph to allow python access Add explicit flush for graph Add fix for metadata in namespace * tidy * tidy * Read only graph * Test metadata * chore: apply tidy-public auto-fixes * Patch the cache * read only index * Adding tests for metadata segments * added new tests * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add read only version of graph to allow python access Add explicit flush for graph Add fix for metadata in namespace * tidy * tidy * Read only graph * Test metadata * chore: apply tidy-public auto-fixes * Patch the cache * read only index * Adding tests for metadata segments * added new tests * chore: apply tidy-public auto-fixes * Fixes for check metadata * Function names --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* move deletion flag to edge id * clean up more python tests that were node-order dependent * clean up a lot of warnings * add num_updates computation * clean up repr implementations and add repr for OptionalEventTime * try and fix the slow materialise on small graphs * surface the error in materialize * we need to track graph property updates as well * more fixes for slow materialise * add num_updates for graph props * fix the doc tests * update more repr tests --------- Co-authored-by: Fabian Murariu <murariu.fabian@gmail.com>

* impl prop redaction * add graph prop redaction, ref * bool optimazation, ref * add test * review suggestions * fmt * chore: apply tidy-public auto-fixes * Make sure the or implementation for the graphql filters uses the underlying or in raphtory * move permissions test to pometry-storage * ref * Fix temporal property redaction in materialization and expose exclude_*_properties/metadata on GraphViewOps * ref * expose get_graph_with_permissions on data * PropertyRedaction ref * fix schema redaction, simplify redaction API, and filter node rows() at source * Push node temporal prop visibility filtering down to storage level in temp_prop_rows * get prop_ids once * restore #[graphql(desc)] annotations lost during db_v4 merge All graphql(desc) annotations from commit a17131e (Ben Steer) were dropped when resolving the merge conflict in raphtory-graphql/src/model/mod.rs during the db_v4 merge (9c715d0). This restores them exactly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * move prop_ids collect outside loops in db_tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fmt * logging independently * ref * intro PermissionError * enforce graph access through typed permission methods and replace string errors with PermissionError * get graphs with read permissions for copy and create subgraph * add comment * fmt * merge from db_v4 * make get_graph private, fix permissions leak * fix test --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Lucas Jeub <lucas.jeub@pometry.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ownload (#2605) * pass arguments through to pytest from tox and disable capture for debugging * make fetch_file atomic * make the atomic swap work in windows where it can fail if the file already exists and is in use * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add namespace creation/deletion to graphql Add TestSetup struct, setup_with_graphs, run_mutation, and assert_is_namespace_dir helpers to mod graphql_test in raphtory-graphql/src/lib.rs for use by namespace tests in later tasks. Previously called validate_path_for_insert which created a graph-folder skeleton + dirty marker on disk and leaked them, so the new namespace appeared as a MetaGraph. Now uses validate_path_for_namespace_create plus fs::create_dir_all. test: createNamespace creates nested directories test: createNamespace rejects path of existing graph test: createNamespace rejects path of existing namespace test: createNamespace rejects invalid paths test: tighten createNamespace existing-namespace error check test: add FakePolicy and setup_with_policy helpers test: createNamespace denied without parent write test: tighten FakePolicy docs and silence dead-code warning test: deleteNamespace removes empty namespace test: deleteNamespace removes namespace with children test: deleteNamespace rejects empty path test: deleteNamespace rejects non-existent path test: deleteNamespace denied when descendant graph unwritable test: deleteNamespace invalidates cached graphs test: clarify deleteNamespace denied-test comments feat(graphql): deleteNamespace infrastructure - auth.rs: add is_exclusive_write so deleteNamespace acquires the exclusive write lock alongside updateGraph - namespace.rs: expose current_dir() and relative_path() accessors used by Mut::delete_namespace and the data layer * Mark paths dirty before cache eviction and in create_namespace * chore: apply tidy-public auto-fixes * Fix race condition in create_namespace * Add tests asserting failure due to lack of permissions * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Shivam <4599890+shivamka1@users.noreply.github.com>

* explicit permissions * fix silent failure in on_graph_created * skip auto-grant on graph create for admin users * fmt * on_graph_created: pass ctx instead of role, delegate identity extraction to policy * grant_namespace_recursive: accept iterator directly, remove intermediate vec allocations * fix CI * simplify on_graph_created error propagation and fix FakePolicy for Option<NamespacePermission> * remove role extraction from server layer; role logic belongs in auth policy * expose Namespace/MetaGraph/NamespacedItem as pub; replace enumerate_namespace_descendants with get_namespace * fix comment

Update UI to v0.3.0 Co-authored-by: Fabian Murariu <2404621+fabianmurariu@users.noreply.github.com>

* remove self dependency in raphtory * trying to fix the features issues with test-utils * move most tests outside of raphtory * is search broken? * move search tests * fmt * fix testutils in graphql # Conflicts: # raphtory-graphql/src/lib.rs * fix graphql testutils * rename raphtory-test-utils to raphtory-tests * remove some useless cfg * chore: apply tidy-public auto-fixes * Remove duplicate members from workspace --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…2617) * remove proto completely as we are no longer planning to support it * use build-fast for python tests * cargo.lock * remove prost from workspace * make sure tests compile * clean up the rest of the merge issues

* cache wip * attempt to make this work with dashmap but it is getting complicated * implement dirty graph handling using pinning only * fix python * remove test of tti-based eviction as it is no longer a thing * chore: apply tidy-public auto-fixes * make the server startup work with port=0 and add fallback when the server is started without giving a specific port * some cleanup * add function to look up port on server * make the cli port behaviour consistent * enable panic on drop errors for tests * make sure graph is dropped before replacing it * update tests more tests so they work with arbitrary ports * make the python tests work even if there is a raphtory server running * Need to actually use the newly materialized graph and not the old graph when inserting with disk storage enabled * moving a graph to the same name should be a no-op, not delete the graph * remove dbg * make sure the timeout test isn't flaky if the server happens to start quickly * chore: apply tidy-public auto-fixes * delete empty tests file * make sure we don't return the unfiltered graph when only filtered access is available * get list of nodes and edges by querying the graph on the server to avoid any potential ordering issues in the test * explicitly control the drop order in test to make sure graph and data are dropped before the directory is deleted * fix drop order problem in tests that would cause the directory to be deleted before the graph is dropped and enable panic on drop for graphql tests by default * port=0 doesn't work for embedding server * refactor replacement and invalidation logic to make it easier to understand and make sure vectorisation is working correctly * cleanup * wait for unique reference when dropping graph * make sure the TempDir is dropped last * add explicit scopes for temporary directories in tests to avoid errors due to directory being cleaned up too early * enable panic-on-drop in tests --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* updates to deal with dependabot * rustls * New tantivy api --------- Co-authored-by: miratepuffin <b.a.steer@qmul.ac.uk>

* Add row_size to WriteLockedPropMapper * Add some docs * Add some docs * Setup print debugging * Some cleanup * Add test for estimated size on unified types * Update row_size after unifying types * Add tests for WriteLockedPropMapper * Run fmt --------- Co-authored-by: Lucas Jeub <lucas.jeub@pometry.com>

* Added update of .meta file when on flush. It should update node and edge counts in this file * Updated refresh of .meta file to not recompute anything other than node and edge counts. .meta file also refreshed when graph goes out of scope * Move .meta file update logic from Raphtory crate into pometry-storage crate (db4-disk-storage). Move tests over as well. * Move GRAPH_META_PATH constant from Raphtory to storage (both db4-storage and db4-disk-storage).

fabubaker and others added 30 commits October 22, 2025 15:28

Add materialize_to_graph_folder to PyGraphView

3c75641

Add comments

4f7cacb

Run fmt

017526e

Use crate instead of super

d371b35

Implement materialize_to_graph_folder for PyGraphView (#2355)

27cc1ad

changes to tests fixtures and update to rand 0.9.2

f1a0159

add quasi-loopy write_session

a44398b

changes to test_utils for vacuum support

1d02d7a

Fix errors due to bumping rand version & changing `build_graph_stra…

c4e840e

…t` (#2357)

some apis for vacuum

432acb5

make edge deletion loading sequential

9aa6046

Recreate metadata correctly from v4 graphs (#2358)

052bf88

add vecuum error

c13f506

Merge branch 'docbrownv4_merge' of github.com:Pometry/Raphtory into v…

e9d5dfe

…acuum

create empty segments for new layers so they aren't lost on write

5ab0549

rename ParquetProp to SerdeProp and move to raphtory-api

40eb155

rename _node_ methods as they are also used for edges

8ec18f4

mark edge segment dirty without triggering a write for metadata-only …

bcae349

…updates

ensure we make an empty segment when there is metadata that needs to …

aaff34a

…be preserved

update rust-version

35b1d49

support for Prop::Map

8259baf

support for Prop::Map refactor

a183369

initialise layers in materialize_at as doing it in resolve deadlocks …

3a208c0

…in the parquet loaders

much more useful location in panic message for graph/search assert fu…

02327f9

…nctions

add dirty flag support for nodes

183c67d

start triaging tests that are known to fail for now

e359088

don't overwrite an existing file

5d9619b

is_decodable needs to check for zip

7b10b98

various changes for ArrowRow and PropRef

247283c

make search feature not default for graphql/python

c93af86

arienandalibi and others added 30 commits May 5, 2026 14:15

Fix flaky tests (#2595)

5d986c0

* add edge id to test query to make sure the sorting works (test should not depend on the order of edges) * add sorting for neighbour ids

chore: apply tidy-public auto-fixes

7f5c6bb

Merge conflict

2347220

fixed export test

aeb003b

chore: apply tidy-public auto-fixes

d100a7a

Test node types are served correctly by the server (#2599)

e169a5c

* Test node types are served correctly by the server * Run fmt * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

chore: apply tidy-public auto-fixes

408400f

chore: apply tidy-public auto-fixes

ea34482

Update SECURITY.md

3bb20c3

Update UI to 3fe855fe0 (v0.3.0) (#2603)

c4282fb

Update UI to v0.3.0 Co-authored-by: Fabian Murariu <2404621+fabianmurariu@users.noreply.github.com>

updates to deal with dependabot (#2619)

f3c716d

* updates to deal with dependabot * rustls * New tantivy api --------- Co-authored-by: miratepuffin <b.a.steer@qmul.ac.uk>

fix toml

d265fb7

fix merge issue in tox.ini

2a97523

fix tox.ini

b5c6940

chore: apply tidy-public auto-fixes

df2b363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB v4 progress#2385

DB v4 progress#2385
ljeub-pometry wants to merge 445 commits into
masterfrom
db_v4

ljeub-pometry commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

ljeub-pometry commented Nov 19, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change? If yes is this documented?

How was this patch tested?

Are there any further changes required?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants