Making Hudi Github Project More Agentic #18324

soumilshah1995 · 2026-03-15T20:22:56Z

soumilshah1995
Mar 15, 2026

Hi everyone — I had a small proposal that could help make our open-source GitHub repos much more AI-agent friendly and easier to maintain.

Idea: introduce a few lightweight repo conventions so AI agents understand the project better and can contribute safely.

A. AGENTS.md
A file that explains the repository structure, coding conventions, and important modules so AI agents understand how the project is organized before making changes.

B. ARCHITECTURE.md
A high-level overview of the system architecture, key components, and data flow. This helps agents reason about design decisions instead of only reading code.

C. Documentation Skill (AI-based)
Introduce an automated documentation workflow where an AI agent periodically scans the repo and generates or updates documentation for APIs, modules, and examples. This could help keep open-source documentation consistently up to date.

Goal:
Make the repo easier for both humans and AI agents to understand, improve onboarding, and enable more reliable automated contributions.

Happy to experiment with a small prototype if folks think this could be useful.

codope · 2026-03-16T05:56:43Z

codope
Mar 16, 2026
Collaborator

Yes, i think this would be very useful for the community. How do we plan to keep the AGENTS/ARCHITECTURE.md files updates with major changes? Also, if you could share some of your own experience of how you use agents with Hudi repo, it would be very helpful for the community.

2 replies

geserdugarov Mar 24, 2026

I suppose that for keeping AGENTS/ARCHITECTURE.md we have to except eventual consistency of them. For instance, someone on weekly basis will burn his tokens, and use the latest model with the biggest context windows to pass through all merged changes, and update those files.

From my experience, the goal of managing of AI-based data should be to keep it tight. It's really easy to allow to grow it exponentially. The more description you have the highest probability of conflicts in it. Removing is hard these days.

cshuo Mar 25, 2026
Collaborator

That's a great point. For keeping AGENTS.md / ARCHITECTURE.md up to date, maybe we can use a lightweight process:

Add a PR checklist item: “Does this change require updates to AGENTS/ARCHITECTURE docs?”
Optionally add a non-blocking CI reminder when large structural changes are detected (new modules/package moves/public API changes).
Do a quick doc consistency pass before each release.

rahil-c · 2026-04-02T04:00:59Z

rahil-c
Apr 2, 2026
Collaborator

@soumilshah1995 One idea that we were talking about in the hudi community sync that @vinothchandar brought up was the idea of revamping the current hudi-cli to embrace this new era of agents. Ideally giving claude, codex or some LLM context on the hudi project in terms of how to run some commands or even adding some new skill for operating a hudi table (such as running table services or tuning certain configs). Happy to sync at some point and talk thru some ideas with you as well!

2 replies

soumilshah1995 Apr 4, 2026
Author

That's a really nice idea

brucearctor Apr 5, 2026

sounds like agentskills.

An example:

https://temporal.io/blog/introducing-temporal-developer-skill

ex ->
https://github.com/temporalio/skill-temporal-developer/tree/main/references/java

vinothchandar · 2026-04-06T20:45:27Z

vinothchandar
Apr 6, 2026
Collaborator

I think indeed this is a very important step that the Hudi community needs to take. I feel there is a fundamental gap here that an AI-native experience can bridge for the project. For many years now, I think we've been having this divide where users want both advanced, powerful features and don't want to deal with the complexity of tuning and learning configs and things like that. Without AI, this unfortunately was not possible. For any other database, you get a system that can do many things powerfully; then you also need to spend the time learning the system.

As the data lake became a lake house, you need data engineers to also level up and become kind of data lake administrators. Unfortunately, this never happened. Engineers are sold completely closed solutions even today for the sake of ease of use.

But I think AI presents us an opportunity to bridge this gap once and for all for open source. I would like to reimagine the interface for the project and for any user to build a lake house to be a Claude Code or codex-like terminal experience.

I think with lots of will and a little bit of skill, I think we can go get there much faster than we think. The vision is you create, build, optimize, deploy -- everything from a terminal agentic CLI.

For starters

fully porting the hudi-cli to this model, with ability to connct to local fs, cloud storage locations, catalogs will help the entire community do more without breaking a sweat.
Build some version aware (foundational models don't get this) RAG over hudi-docs, can help optimize config tuning, along with an optimize skill.

I built a toy MCP server with some 27 tools, over the weekend, with auto discovery of tables (local fs for now, we can bring catalogs and cloud storage in there pretty easily). I think we can even make @hudi/table/<table_name> work as MCP resources.

[13:25:58] hudi-agent % claude
 ▐▛███▜▌   Claude Code v2.1.92
...
...
❯ what hudi tables do I have?

⏺ hudi - hudi_list_tables (MCP)
  ⎿  {
       "result": "Discovered 11 table(s):\n  hudi_demo_table                 COPY_ON_WRITE         /private/tmp/hudi-
     agent-demo/demo_table\n  customers                       COPY_ON_WRITE         /private/tmp/hudi-datalake/custom
     … +9 lines (ctrl+o to expand)

⏺ You have 11 Hudi tables across two storage locations:

  ┌─────────────────┬──────┬──────────────────────────────────────────┐
  │      Table      │ Type │                   Path                   │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ hudi_demo_table │ COW  │ /private/tmp/hudi-agent-demo/demo_table  │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ customers       │ COW  │ /private/tmp/hudi-datalake/customers     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ shipments       │ COW  │ /private/tmp/hudi-datalake/shipments     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ payments        │ COW  │ /private/tmp/hudi-datalake/payments      │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ products        │ COW  │ /private/tmp/hudi-datalake/products      │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ user_sessions   │ MOR  │ /private/tmp/hudi-datalake/user_sessions │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ audit_log       │ COW  │ /private/tmp/hudi-datalake/audit_log     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ clickstream     │ MOR  │ /private/tmp/hudi-datalake/clickstream   │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ inventory       │ MOR  │ /private/tmp/hudi-datalake/inventory     │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ orders          │ COW  │ /private/tmp/hudi-datalake/orders        │
  ├─────────────────┼──────┼──────────────────────────────────────────┤
  │ reviews         │ MOR  │ /private/tmp/hudi-datalake/reviews       │
  └─────────────────┴──────┴──────────────────────────────────────────┘

  6 are Copy-on-Write and 4 are Merge-on-Read tables. Most live under /private/tmp/hudi-datalake/. Want to inspect any of
  them?

❯ Analyze layout of table clickstream

⏺ hudi - hudi_describe_table (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"Property\": \"basePath\",\n    \"Value\": \"/private/tmp/hudi-datalake/clickstream\"\
     n  },\n  {\n    \"Property\": \"metaPath\",\n    \"Value\": \"/private/tmp/hudi-datalake/clickstream/.hoodie\"\n
     … +25 lines (ctrl+o to expand)

⏺ hudi - hudi_fetch_schema (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "{\n  \"type\" : \"record\",\n  \"name\" : \"clickstream_record\",\n  \"namespace\" : \"hoodie.click
     stream\",\n  \"fields\" : [ {\n    \"name\" : \"_hoodie_commit_time\",\n    \"type\" : [ \"null\", \"string\" ],
     … +10 lines (ctrl+o to expand)

⏺ hudi - hudi_timeline_active (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"Instant\": \"20260404203259123\",\n    \"Action\": \"deltacommit\",\n    \"State\": \
     "COMPLETED\"\n  }\n]"
     }

⏺ hudi - hudi_metadata_partitions (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  \"dt=2024-01-01\",\n  \"dt=2024-01-02\",\n  \"dt=2024-01-03\"\n]"
     }

⏺ hudi - hudi_metadata_stats (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ hudi - hudi_commits_show (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[\n  {\n    \"CommitTime\": \"20260404203259123\",\n    \"Action\": \"deltacommit\",\n    \"TotalBy
     tesWritten\": 1307319,\n    \"TotalFilesInsert\": 3,\n    \"TotalFilesUpdated\": 0,\n    \"Partitions\": 3,\n
     … +2 lines (ctrl+o to expand)

⏺ hudi - hudi_metadata_files (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ hudi - hudi_compactions_show (MCP)(table_name: "clickstream")
  ⎿  {
       "result": "[]"
     }

⏺ Here's the full layout analysis for clickstream:

  ---
  Table Identity

  ┌─────────────────┬────────────────────────────────────────┐
  │    Property     │                 Value                  │
  ├─────────────────┼────────────────────────────────────────┤
  │ Type            │ Merge-on-Read (MOR)                    │
  ├─────────────────┼────────────────────────────────────────┤
  │ Base path       │ /private/tmp/hudi-datalake/clickstream │
  ├─────────────────┼────────────────────────────────────────┤
  │ Record key      │ id                                     │
  ├─────────────────┼────────────────────────────────────────┤
  │ Partition field │ dt (hive-style)                        │
  ├─────────────────┼────────────────────────────────────────┤
  │ Ordering field  │ ts                                     │
  ├─────────────────┼────────────────────────────────────────┤
  │ Merge mode      │ EVENT_TIME_ORDERING                    │
  ├─────────────────┼────────────────────────────────────────┤
  │ Key generator   │ SIMPLE                                 │
  ├─────────────────┼────────────────────────────────────────┤
  │ Table version   │ 9                                      │
  ├─────────────────┼────────────────────────────────────────┤
  │ Timeline layout │ v2                                     │
  └─────────────────┴────────────────────────────────────────┘

  Schema (6 user fields + 5 meta fields)

  ┌─────────┬─────────────────────────────────────┐
  │  Field  │                Type                 │
  ├─────────┼─────────────────────────────────────┤
  │ id      │ int (nullable)                      │
  ├─────────┼─────────────────────────────────────┤
  │ user_id │ int (nullable)                      │
  ├─────────┼─────────────────────────────────────┤
  │ page    │ string (nullable)                   │
  ├─────────┼─────────────────────────────────────┤
  │ action  │ string (nullable)                   │
  ├─────────┼─────────────────────────────────────┤
  │ ts      │ long (nullable) — ordering field    │
  ├─────────┼─────────────────────────────────────┤
  │ dt      │ string (nullable) — partition field │
  └─────────┴─────────────────────────────────────┘

  Plus the standard _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path,
  _hoodie_file_name meta fields.

  Partitions (3)

  - dt=2024-01-01
  - dt=2024-01-02
  - dt=2024-01-03

  Timeline & Commits

  Single completed deltacommit at 20260404203259123:
  - 8 records inserted across 3 partitions
  - 3 log files written (one per partition — typical for MOR inserts)
  - ~1.3 MB total bytes written
  - 0 updates, 0 write errors

  Metadata Table

  Enabled indexes: files, column_stats, partition_stats

  Compactions

  None pending. Since there's only one deltacommit with log files and no base (parquet) files yet, the table is entirely
  in delta log form. A compaction would convert the logs into columnar base files — something to consider if you plan to
  run read-optimized queries.

  ---
  Observations

  1. Very young table — single commit, 8 records. Essentially a fresh table with seed data.
  2. All data in log files — no base files exist yet. Snapshot queries will read logs directly; read-optimized queries
  would return nothing until a compaction runs.
  3. Even distribution — ~2-3 records per partition across the 3 date partitions.
  4. Good metadata setup — column_stats and partition_stats are enabled, which will help with data skipping as the table
  grows.

  Would you like me to dig deeper into any specific aspect, or look at the file-level details within a partition?

✻ Cooked for 33s

2 replies

vinothchandar Apr 6, 2026
Collaborator

This can be get way sharper if we leaned in and wrote skills

nsivabalan Apr 10, 2026
Collaborator

Right way to go Vinoth! We should totally build this. I can also volunteer to contribute or coordinate the effort with others.

vinothchandar · 2026-04-06T20:50:58Z

vinothchandar
Apr 6, 2026
Collaborator

@soumilshah1995 on the original AGENTS.md/ARCHITECTURE.md/CLAUDE.md +1 . I think @yihua has a PR for CLAUDE.md file. we should consider that as well. but happy to take contributions around this. get it landed and keep chipping away

3 replies

soumilshah1995 Apr 8, 2026
Author

Hi @vinothchandar — your ideas are fantastic. Having the Hudi CLI controlled by Claude sounds like a great starting point.
Also, could you share if there’s an existing CLAUDE.md in any active PR? If not, I’d be happy to help add one. We could initialize it using /init, let Claude scan the repository, and generate a baseline CLAUDE.md. From there, we can gradually evolve it by adding rules, coding standards, and best practices.

This could help make the Hudi repo more agent-friendly, and for anyone using modern AI tools, it would also encourage better documentation and consistency across contributions.

yihua Apr 10, 2026
Collaborator

@soumilshah1995 Thanks for the initiatives! Here's the WIP PR on adding Claude Skills for Hudi: #18338

Do you want to create a PR on adding CLAUDE.md, then we can iterate over it?

soumilshah1995 Apr 10, 2026
Author

Thanks a lot for sharing this

rahil-c · 2026-04-10T17:49:30Z

rahil-c
Apr 10, 2026
Collaborator

@soumilshah1995 thanks soumil for starting the discussion! If there is something you want to take up wondering if you want to just file a github issue and I can assign to you based on the discussion we are having. Also happy to jam over slack if you want feel free to reach out!

1 reply

soumilshah1995 Apr 13, 2026
Author

Hi there, I’m planning to build a “delta streamer” lab setup that lets anyone get started in just a few minutes locally.
Of course, I’m using Claude to help with this. The idea is to make the learning process fun and easy, instead of people getting stuck worrying about knobs and configuration.

I’ve already completed around 12 labs on Delta Streamer, and now I want to take it further by building an agent-driven learning experience.

https://github.com/soumilshah1995/apache-hudi-delta-streamer-labs

nsivabalan · 2026-04-16T02:59:05Z

nsivabalan
Apr 16, 2026
Collaborator

Let me recap a structured approach for integrating AI agents into Hudi in a practical and incremental way.

Workstream 1: Agent-Friendly Repository

Introduce lightweight, maintainable documentation to help both humans and AI agents better understand the project:

Add a concise AGENTS.md describing key concepts, workflows, and entry points
Add an ARCHITECTURE.md capturing high-level system design and core components
Optionally introduce a PR checklist item to keep these documents updated as the code evolves

The goal is to improve discoverability without introducing heavy maintenance overhead.

Workstream 2: Agent-Friendly Runtime Interface

This aligns with the broader vision of a Claude/Codex-like experience for building and operating lakehouses. (Vinoth's proposal)

Phase 1: Read-Only Investigation Tools

Create a modern Hudi operator surface (via MCP and/or an enhanced CLI) focused on safe, read-only operations:

List tables
Describe table metadata
Fetch and inspect timeline
Summarize recent commits (inserts/updates)
Show table services state (compaction, clustering, cleaning)
Explain key configurations and their impact

This phase focuses on observability and understanding, enabling users (and agents) to reason about table state safely.

Phase 2: Guided Actions (Controlled Mutations)

After validating read-only workflows and gathering community feedback, introduce controlled write operations:

Trigger compaction and clustering
Trigger cleaner and archival
Trigger rollbacks for pending/inflight ingestion commits
Suggest configuration changes based on observed patterns
Generate commands first (instead of executing directly), allowing users to review before applying

The emphasis here is on safe, guided execution, not full automation.

Workstream 3: Guided Labs and Onboarding

Provide structured, hands-on experiences to help users get started quickly:

// This is what Soumil proposed above or have already started efforts on
Create guided workflows for:
Table creation and ingestion
Incremental processing
Table services (compaction, clustering)
Optionally integrate agent-assisted walkthroughs

The goal is to reduce the learning curve and accelerate adoption.

Workstream 4: Troubleshooting and Diagnostics

Enable agents to assist with common operational issues:

Diagnose ingestion failures and stuck pipelines
Identify performance bottlenecks (e.g., small files, skew, slow queries)
Analyze timeline issues (pending/inflight commits)
Provide actionable recommendations for fixes

This workstream focuses on operational productivity and debugging efficiency.

0 replies

nsivabalan · 2026-04-23T03:28:42Z

nsivabalan
Apr 23, 2026
Collaborator

hudi cli mcp support #18554

0 replies

Making Hudi Github Project More Agentic #18324

Uh oh!

soumilshah1995 Mar 15, 2026

Replies: 7 comments · 10 replies

Uh oh!

codope Mar 16, 2026 Collaborator

Uh oh!

Uh oh!

geserdugarov Mar 24, 2026

Uh oh!

cshuo Mar 25, 2026 Collaborator

Uh oh!

rahil-c Apr 2, 2026 Collaborator

Uh oh!

soumilshah1995 Apr 4, 2026 Author

Uh oh!

brucearctor Apr 5, 2026

Uh oh!

vinothchandar Apr 6, 2026 Collaborator

Uh oh!

vinothchandar Apr 6, 2026 Collaborator

Uh oh!

nsivabalan Apr 10, 2026 Collaborator

Uh oh!

vinothchandar Apr 6, 2026 Collaborator

Uh oh!

soumilshah1995 Apr 8, 2026 Author

Uh oh!

yihua Apr 10, 2026 Collaborator

Uh oh!

soumilshah1995 Apr 10, 2026 Author

Uh oh!

rahil-c Apr 10, 2026 Collaborator

Uh oh!

Uh oh!

soumilshah1995 Apr 13, 2026 Author

Uh oh!

nsivabalan Apr 16, 2026 Collaborator

Uh oh!

nsivabalan Apr 23, 2026 Collaborator

soumilshah1995
Mar 15, 2026

Replies: 7 comments 10 replies

codope
Mar 16, 2026
Collaborator

cshuo Mar 25, 2026
Collaborator

rahil-c
Apr 2, 2026
Collaborator

soumilshah1995 Apr 4, 2026
Author

vinothchandar
Apr 6, 2026
Collaborator

vinothchandar Apr 6, 2026
Collaborator

nsivabalan Apr 10, 2026
Collaborator

vinothchandar
Apr 6, 2026
Collaborator

soumilshah1995 Apr 8, 2026
Author

yihua Apr 10, 2026
Collaborator

soumilshah1995 Apr 10, 2026
Author

rahil-c
Apr 10, 2026
Collaborator

soumilshah1995 Apr 13, 2026
Author

nsivabalan
Apr 16, 2026
Collaborator

nsivabalan
Apr 23, 2026
Collaborator