Modify data transfer between nodes into: store Parquet on-disk --> DuckDB to query data --> send Arrow byte-stream to frotend#118
Open
shaddad3 wants to merge 25 commits into
Open
Conversation
draft image of current data transfer in Curio & proposed changes with DuckDB + Arrow
…rch for data transfer speed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
I aimed to modify the data transfer between nodes in Curio to leverage parquet for data storage and DuckDB as a query engine, and to modify the frontend so it receives an Arrow byte-stream instead of JSON.
This contribution is the product of my course project with Professor Miranda for his CS 524 class.
At a high level, my changes are as follows:
Please see the docs/Project_Documentation_And_Journey.md documentation to see what the changes I made were to each piece of Curio. It can also be downloaded here if that is easier: Project_Documentation_and_Journey.md
Issue resolved by this PR (if any)
Type of change (Check all that apply)
Parts of Curio impacted by this PR:
--- See the docs/Project_Documentation_And_Journey.md documentation to see what the changes I made were to each piece of Curio.
Testing
./scripts/test.shand it passed all tests.--- I passed all the unit tests in this file, and did manual testing with the JSON workflows in the docs/examples/ folder.
Screenshots (if relevant)
Checklist (Check all that apply)
tests/folder into Curio, ran all the nodes one by one, and checked that they run without errors and give the expected results--- I have comments in parts of my code, but they are not great all the time. I have added a comprehensive markdown document of all the changes I made, which can be found in docs/Project_Documentation_And_Journey.md.
--- I did not make changes to the documentation but added a comprehensive markdown document of all the changes I made, which can again be found in docs/Project_Documentation_And_Journey.md.
--- I don't believe any new warnings pop up for my version, though I can't guarantee this as I wasn't aware there are any existing warnings Curio has.
--- I have added a
benchmark_scripts/' folder which aims to give an idea of the speedup, file size reduction, and memory improvement my changes make. These can be run by doing the commandpython benchmark_xxx.py. There are 6 differentbenchmark_xxx.py` files in this folder.--- I ran
./scripts/test.sh --unit-onlyand these all pass.