DB v4 progress#2385
Draft
ljeub-pometry wants to merge 445 commits into
Draft
Conversation
…in the parquet loaders
* Updating Parquet* structs to support manually passed export vids, eids, and layer_ids * Allowed IDs to be passed to parquet serialization. Will allow us to pre-compute new IDs and turn them into RecordBatches * Changed Parquet encoding to take GraphView instead of GraphStorage. Lock the graph to get parallel iterators over edges. We filter to respect GraphView filtering behaviour. * Fixed node and edge parallel iterator creation * Making the parquet encoders generic over the writer (now sink). We still use ArrowWriter<File> for now, but we will add support for loading into a graph * Changed Parquet writer from ArrowWriter to generic sink for nodes, edges, and graph. * Fixed possible ParquetDelEdge layer_id and layer_name issues by calling explode_layers() on each EdgeView. * Fixed path error * Made all the encode_* functions generic over the sink. A sink factory function can now be passed to these functions to determine how the sinks will be created. This will allow us to pass a sink which is a crossbeam_channel to send RecordBatches elsewhere. * Adding Receiver side on materialize * Hid new materialize behind IO feature and added a test to test the new materialize function * Adding logic to ingest data using load_*_from_df functions * Fixed deadlock. It had to do with LayerMappers being shared between edge_meta and node_meta. * Removed unused variable bindings * Fixed deadlock caused by DictMapper deep_clone not creating a new lock and reusing the old one. * Working on making materialize stream RecordBatches properly instead of encoding everything and then ingesting everything (which would keep everything in memory at once). * Changed std::thread::scope for a rayon::scope * Added a test that times the old and new materialize functions * Debugging materialize_using_recordbatches to see why it freezes/hangs when run on a big graph. * Changed to make encoding using its own thread pool and ingestion use another thread pool. * Switched materialize test to use graph paths and have disk backed storage so that it doesn't run out of memory * Improved ingestion time on the "load_*_from_df" path by avoiding rescanning each segment for each row. Now using this path in the new materialize_using_recordbatches function. * Switched assert_graph_equals to be parallel instead of multi-threaded as much as possible * Rustfmt * Use graph_equals instead of our custom GraphSummary. Update tests to separate out running materialize and parquet decoding. Test using SF10 for now. * Set up environment variables to configure database properly before materialize test. * Added Jemalloc * Removed some unnecessary #[cfg(feature = "io")] gates. Use constants for parquet encoded column names. * Added a test to time loading SF10 dumped parquet files using the df_loader functions * Brought zips back in df_loaders/edges.rs for passing data such as vids, eids, flags, etc... to helper functions * Removed flushing of graph before ingesting RecordBatch in df_loaders. General cleanup * Removed unused imports, changed jemalloc to only be used on MacOS, and changed std::thread::scope for rayon::scope. * Moving df_loaders out of io feature * Move LOAD_POOL out of "io" feature * Move ENCODE_POOL out of "io" feature * Removing some #[cfg(feature = "io")] gates related to materialize_using_recordbatches * Moved folder from serialise::parquet out of serialise folder (so out of "io" feature). Added serialise::parquet.rs file for everything that couldn't be moved out because it depended on dependencies from io feature. * Fixed feature gating behind io and progress * Moved SNB SF1, SF3, SF10 tests to their own separate file * Added test for a filtered graph * Renamed parquet folder to parquet_encoder * Fixed encoders to pass relevant information in NodesT, EdgesC, and EdgesT. This includes Node GIDs and node types. Propagated changes to materialize_using_recordbatches. Filtered test passes. * Lower channel size * Fixes after merge * Fixed test * Fixed io feature gating * Added layer creation before creating the temporal graph to ensure empty layers are created. * Updated edges iteration in parquet encoders so that EIDs get resolved compacted for each layer. This saves a lot of disk space when saved to a directory. * Clean up after filtered sf1 test * No need to set the env vars for raphtory settings, they are imported and copied from the graph on disk. * Added layer names to the parquet files to avoid filename collision when creating the arrow writer for parquet encoding. * Cleaned up test_materialize.rs imports * Switched old materialize for the new one to run tests * Fix bug in resolve_node_and_meta_for_node_col where nodes were not being resolved, only looked up, which was causing failures related to metadata not being added for nodes that haven't already been resolved. * Materialize edge deletions before edge c props (edge metadata) to fix materialization bug regarding persistent graphs * Attempting to fix temporal properties not being serialized properly on persistent graphs * Got rid of layer_n in parquet filenames. They were causing problems with ordering of parquet files when loading data. Instead, we now have atomically incrementing counters for file ids. * Preserve property mappers in materialize * Fix bugs in materialize. Switch rayon::scope for std::thread::scope to avoid a deadlock when the scope's num_threads is 1. Removed resizing of segments to the max eid to avoid empty segments when a graph is filtered. This was leading to empty graph errors. * Remove sf3 paths in test_materialize_sf10.rs, * Remove channel for producer in materialize * Added flag to resolve nodes when materializing in load_node_props_from_df, and internalise otherwise * First try at is_materializing flag in load_node_props_from_df * Fixed test_materialize_sf10.rs feature gating on imports * Added t_len for NodeStorageInner * Clean up imports a lil * Fix normalise_temporal_map not properly defining a stable deterministic ordering for events at the same timestamp for Prop::List (Vec and Array should be the same) and Prop::Map (ordering of elements should be stable, previously depended on HashMap iteration order which is undefined). * Added edge.properties().temporal().iter_ids() and used it in the serialization of ParquetTEdge. Cleaned up materialize tests so that they don't try to call an "old" materialize anymore * Clean up test file * Get rid of old materialize * Revert edge endpoint VID parquet column names to "rap_src_id" and "rap_dst_id". GIDS are now "rap_src_id" and "rap_dst_id". This is inconsistent with other column's naming scheme, but it is backwards compatible with already encoded parquet files. * Changing parquet column names so they're consistent * Update parquet files --------- Co-authored-by: Lucas Jeub <lucas.jeub@pometry.com>
* add edge id to test query to make sure the sorting works (test should not depend on the order of edges) * add sorting for neighbour ids
* make sure we cancel all tasks when the running server is dropped * update optd * add domain for NodeOp * avoid unnecessarily re-filtering the domain when it is correct * changes to better support Bn edge sized graphs fixing last compile error track count temporal edges * remove accidental pyo3 import * small import updates * should call list_filtered in nodes * const_value_in_domain should be the same as const_value by default * possible improvements to UI for very large graphs * still need to check that the edge exists in the layer, even if we have the edge ref already * no optimisation in with_debug as they make debugging more annoying * filtering by node is really bad for window so change this back * fix materialize double-adding temporal edges * for a persistent graph the update history and properties for exploded edges are not the same * need to look at explode() for history on persistent graphs * attempt at faster node_valid * include updates from static graph in node_valid check for layers * cleanup * fix search feature * make component test easier to debug on failure * add our own union find implementation based on the old connected components algorithm (maybe can be optimised but at least it seems correct) * clean up dependencies * storage dependency is definitely used * avoid compiling the vectors feature in benchmarks unless it is actually needed * implement has_layer_inner directly * optimise last for filtered additions * add fast path for getting edge ref out again * attempt to optimise SVM * use optimised active check * some inlines * minimise the size of the MemEdgeRef while still including src/dst information * add src/dst to MemEdgeEntry as well * remove sorted_vector_map dependency and clean up * no real reason to capture src/dst on the MemEdgeRef/MemEdgeEntry as these should be cheap to look up * fix subgraph filtering * chore: apply tidy-public auto-fixes * more optimisations for windowing * cleanup * remove dbg * when working with disk storage, in-memory references don't always exist * minor cleanup * bring num_nodes up to speed * more fixes for layered graphs * replace some kmerge with fast_merge * more optimisations for windowing * add check for filtering that excludes layer --------- Co-authored-by: Ben Steer <b.a.steer@qmul.ac.uk> Co-authored-by: Fabian Murariu <murariu.fabian@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* make sure we cancel all tasks when the running server is dropped * update optd * add domain for NodeOp * avoid unnecessarily re-filtering the domain when it is correct * changes to better support Bn edge sized graphs fixing last compile error track count temporal edges * remove accidental pyo3 import * small import updates * should call list_filtered in nodes * const_value_in_domain should be the same as const_value by default * possible improvements to UI for very large graphs * still need to check that the edge exists in the layer, even if we have the edge ref already * no optimisation in with_debug as they make debugging more annoying * filtering by node is really bad for window so change this back * fix materialize double-adding temporal edges * for a persistent graph the update history and properties for exploded edges are not the same * need to look at explode() for history on persistent graphs * attempt at faster node_valid * include updates from static graph in node_valid check for layers * cleanup * fix search feature * make component test easier to debug on failure * add our own union find implementation based on the old connected components algorithm (maybe can be optimised but at least it seems correct) * clean up dependencies * storage dependency is definitely used * avoid compiling the vectors feature in benchmarks unless it is actually needed * implement has_layer_inner directly * optimise last for filtered additions * add fast path for getting edge ref out again * attempt to optimise SVM * use optimised active check * some inlines * minimise the size of the MemEdgeRef while still including src/dst information * add src/dst to MemEdgeEntry as well * remove sorted_vector_map dependency and clean up * no real reason to capture src/dst on the MemEdgeRef/MemEdgeEntry as these should be cheap to look up * fix subgraph filtering * chore: apply tidy-public auto-fixes * more optimisations for windowing * cleanup * remove dbg * when working with disk storage, in-memory references don't always exist * minor cleanup * bring num_nodes up to speed * more fixes for layered graphs * replace some kmerge with fast_merge * more optimisations for windowing * add check for filtering that excludes layer * make list properties always return numpy arrays --------- Co-authored-by: Ben Steer <b.a.steer@qmul.ac.uk> Co-authored-by: Fabian Murariu <murariu.fabian@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Test node types are served correctly by the server * Run fmt * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add read only version of graph to allow python access Add explicit flush for graph Add fix for metadata in namespace * tidy * tidy * Read only graph * Test metadata * chore: apply tidy-public auto-fixes * Patch the cache * read only index * Adding tests for metadata segments * added new tests * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add read only version of graph to allow python access Add explicit flush for graph Add fix for metadata in namespace * tidy * tidy * Read only graph * Test metadata * chore: apply tidy-public auto-fixes * Patch the cache * read only index * Adding tests for metadata segments * added new tests * chore: apply tidy-public auto-fixes * Fixes for check metadata * Function names --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* move deletion flag to edge id * clean up more python tests that were node-order dependent * clean up a lot of warnings * add num_updates computation * clean up repr implementations and add repr for OptionalEventTime * try and fix the slow materialise on small graphs * surface the error in materialize * we need to track graph property updates as well * more fixes for slow materialise * add num_updates for graph props * fix the doc tests * update more repr tests --------- Co-authored-by: Fabian Murariu <murariu.fabian@gmail.com>
* impl prop redaction * add graph prop redaction, ref * bool optimazation, ref * add test * review suggestions * fmt * chore: apply tidy-public auto-fixes * Make sure the or implementation for the graphql filters uses the underlying or in raphtory * move permissions test to pometry-storage * ref * Fix temporal property redaction in materialization and expose exclude_*_properties/metadata on GraphViewOps * ref * expose get_graph_with_permissions on data * PropertyRedaction ref * fix schema redaction, simplify redaction API, and filter node rows() at source * Push node temporal prop visibility filtering down to storage level in temp_prop_rows * get prop_ids once * restore #[graphql(desc)] annotations lost during db_v4 merge All graphql(desc) annotations from commit a17131e (Ben Steer) were dropped when resolving the merge conflict in raphtory-graphql/src/model/mod.rs during the db_v4 merge (9c715d0). This restores them exactly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * move prop_ids collect outside loops in db_tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fmt * logging independently * ref * intro PermissionError * enforce graph access through typed permission methods and replace string errors with PermissionError * get graphs with read permissions for copy and create subgraph * add comment * fmt * merge from db_v4 * make get_graph private, fix permissions leak * fix test --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Lucas Jeub <lucas.jeub@pometry.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ownload (#2605) * pass arguments through to pytest from tox and disable capture for debugging * make fetch_file atomic * make the atomic swap work in windows where it can fail if the file already exists and is in use * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Add namespace creation/deletion to graphql Add TestSetup struct, setup_with_graphs, run_mutation, and assert_is_namespace_dir helpers to mod graphql_test in raphtory-graphql/src/lib.rs for use by namespace tests in later tasks. Previously called validate_path_for_insert which created a graph-folder skeleton + dirty marker on disk and leaked them, so the new namespace appeared as a MetaGraph. Now uses validate_path_for_namespace_create plus fs::create_dir_all. test: createNamespace creates nested directories test: createNamespace rejects path of existing graph test: createNamespace rejects path of existing namespace test: createNamespace rejects invalid paths test: tighten createNamespace existing-namespace error check test: add FakePolicy and setup_with_policy helpers test: createNamespace denied without parent write test: tighten FakePolicy docs and silence dead-code warning test: deleteNamespace removes empty namespace test: deleteNamespace removes namespace with children test: deleteNamespace rejects empty path test: deleteNamespace rejects non-existent path test: deleteNamespace denied when descendant graph unwritable test: deleteNamespace invalidates cached graphs test: clarify deleteNamespace denied-test comments feat(graphql): deleteNamespace infrastructure - auth.rs: add is_exclusive_write so deleteNamespace acquires the exclusive write lock alongside updateGraph - namespace.rs: expose current_dir() and relative_path() accessors used by Mut::delete_namespace and the data layer * Mark paths dirty before cache eviction and in create_namespace * chore: apply tidy-public auto-fixes * Fix race condition in create_namespace * Add tests asserting failure due to lack of permissions * chore: apply tidy-public auto-fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Shivam <4599890+shivamka1@users.noreply.github.com>
* explicit permissions * fix silent failure in on_graph_created * skip auto-grant on graph create for admin users * fmt * on_graph_created: pass ctx instead of role, delegate identity extraction to policy * grant_namespace_recursive: accept iterator directly, remove intermediate vec allocations * fix CI * simplify on_graph_created error propagation and fix FakePolicy for Option<NamespacePermission> * remove role extraction from server layer; role logic belongs in auth policy * expose Namespace/MetaGraph/NamespacedItem as pub; replace enumerate_namespace_descendants with get_namespace * fix comment
Update UI to v0.3.0 Co-authored-by: Fabian Murariu <2404621+fabianmurariu@users.noreply.github.com>
* remove self dependency in raphtory * trying to fix the features issues with test-utils * move most tests outside of raphtory * is search broken? * move search tests * fmt * fix testutils in graphql # Conflicts: # raphtory-graphql/src/lib.rs * fix graphql testutils * rename raphtory-test-utils to raphtory-tests * remove some useless cfg * chore: apply tidy-public auto-fixes * Remove duplicate members from workspace --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…2617) * remove proto completely as we are no longer planning to support it * use build-fast for python tests * cargo.lock * remove prost from workspace * make sure tests compile * clean up the rest of the merge issues
* cache wip * attempt to make this work with dashmap but it is getting complicated * implement dirty graph handling using pinning only * fix python * remove test of tti-based eviction as it is no longer a thing * chore: apply tidy-public auto-fixes * make the server startup work with port=0 and add fallback when the server is started without giving a specific port * some cleanup * add function to look up port on server * make the cli port behaviour consistent * enable panic on drop errors for tests * make sure graph is dropped before replacing it * update tests more tests so they work with arbitrary ports * make the python tests work even if there is a raphtory server running * Need to actually use the newly materialized graph and not the old graph when inserting with disk storage enabled * moving a graph to the same name should be a no-op, not delete the graph * remove dbg * make sure the timeout test isn't flaky if the server happens to start quickly * chore: apply tidy-public auto-fixes * delete empty tests file * make sure we don't return the unfiltered graph when only filtered access is available * get list of nodes and edges by querying the graph on the server to avoid any potential ordering issues in the test * explicitly control the drop order in test to make sure graph and data are dropped before the directory is deleted * fix drop order problem in tests that would cause the directory to be deleted before the graph is dropped and enable panic on drop for graphql tests by default * port=0 doesn't work for embedding server * refactor replacement and invalidation logic to make it easier to understand and make sure vectorisation is working correctly * cleanup * wait for unique reference when dropping graph * make sure the TempDir is dropped last * add explicit scopes for temporary directories in tests to avoid errors due to directory being cleaned up too early * enable panic-on-drop in tests --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* updates to deal with dependabot * rustls * New tantivy api --------- Co-authored-by: miratepuffin <b.a.steer@qmul.ac.uk>
* Add row_size to WriteLockedPropMapper * Add some docs * Add some docs * Setup print debugging * Some cleanup * Add test for estimated size on unified types * Update row_size after unifying types * Add tests for WriteLockedPropMapper * Run fmt --------- Co-authored-by: Lucas Jeub <lucas.jeub@pometry.com>
* Added update of .meta file when on flush. It should update node and edge counts in this file * Updated refresh of .meta file to not recompute anything other than node and edge counts. .meta file also refreshed when graph goes out of scope * Move .meta file update logic from Raphtory crate into pometry-storage crate (db4-disk-storage). Move tests over as well. * Move GRAPH_META_PATH constant from Raphtory to storage (both db4-storage and db4-disk-storage).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Progress towards the new version of the underlying storage
Why are the changes needed?
Does this PR introduce any user-facing change? If yes is this documented?
How was this patch tested?
Are there any further changes required?