Skip to content

Modify data transfer between nodes into: store Parquet on-disk --> DuckDB to query data --> send Arrow byte-stream to frotend#118

Open
shaddad3 wants to merge 25 commits into
urban-toolkit:mainfrom
shaddad3:main
Open

Modify data transfer between nodes into: store Parquet on-disk --> DuckDB to query data --> send Arrow byte-stream to frotend#118
shaddad3 wants to merge 25 commits into
urban-toolkit:mainfrom
shaddad3:main

Conversation

@shaddad3
Copy link
Copy Markdown

@shaddad3 shaddad3 commented May 8, 2026

Describe your changes

I aimed to modify the data transfer between nodes in Curio to leverage parquet for data storage and DuckDB as a query engine, and to modify the frontend so it receives an Arrow byte-stream instead of JSON.

This contribution is the product of my course project with Professor Miranda for his CS 524 class.

At a high level, my changes are as follows:

  1. Store .parquet files on-disk instead of .data files (still in the folder ./.curio/data). This means we skip the serialization & compression steps done before and instead store files in-memory in columnar format.
  2. Spin up a temporary, in-memory DuckDB instance to efficiently and directly query the .parquet files and retrieve the data in an Arrow Table.
  3. Streaming the Arrow Table to the frontend instead of JSON.

Please see the docs/Project_Documentation_And_Journey.md documentation to see what the changes I made were to each piece of Curio. It can also be downloaded here if that is easier: Project_Documentation_and_Journey.md

Issue resolved by this PR (if any)

  • Issue Number:
  • Link:

Type of change (Check all that apply)

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update
  • Other:

Parts of Curio impacted by this PR:

  • Frontend
  • Backend
  • Sandbox

--- See the docs/Project_Documentation_And_Journey.md documentation to see what the changes I made were to each piece of Curio.

Testing

  • Executed ./scripts/test.sh and it passed all tests.

--- I passed all the unit tests in this file, and did manual testing with the JSON workflows in the docs/examples/ folder.

Screenshots (if relevant)

Checklist (Check all that apply)

  • I have manually loaded each .json test from the tests/ folder into Curio, ran all the nodes one by one, and checked that they run without errors and give the expected results
  • I have commented my code, particularly in hard-to-understand areas
    --- I have comments in parts of my code, but they are not great all the time. I have added a comprehensive markdown document of all the changes I made, which can be found in docs/Project_Documentation_And_Journey.md.
  • I have made corresponding changes to the documentation
    --- I did not make changes to the documentation but added a comprehensive markdown document of all the changes I made, which can again be found in docs/Project_Documentation_And_Journey.md.
  • My changes generate no new warnings
    --- I don't believe any new warnings pop up for my version, though I can't guarantee this as I wasn't aware there are any existing warnings Curio has.
  • I have added tests that prove my fix is effective or that my feature works
    --- I have added a benchmark_scripts/' folder which aims to give an idea of the speedup, file size reduction, and memory improvement my changes make. These can be run by doing the command python benchmark_xxx.py. There are 6 different benchmark_xxx.py` files in this folder.
  • New and existing unit tests pass locally with my changes
    --- I ran ./scripts/test.sh --unit-only and these all pass.
  • Any dependent changes have been merged and published in downstream modules

shaddad3 added 25 commits March 10, 2026 08:13
draft image of current data transfer in Curio & proposed changes with DuckDB + Arrow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant