Making Hudi Github Project More Agentic #18324
Replies: 7 comments 10 replies
-
|
Yes, i think this would be very useful for the community. How do we plan to keep the AGENTS/ARCHITECTURE.md files updates with major changes? Also, if you could share some of your own experience of how you use agents with Hudi repo, it would be very helpful for the community. |
Beta Was this translation helpful? Give feedback.
-
|
@soumilshah1995 One idea that we were talking about in the hudi community sync that @vinothchandar brought up was the idea of revamping the current hudi-cli to embrace this new era of agents. Ideally giving |
Beta Was this translation helpful? Give feedback.
-
|
I think indeed this is a very important step that the Hudi community needs to take. I feel there is a fundamental gap here that an AI-native experience can bridge for the project. For many years now, I think we've been having this divide where users want both advanced, powerful features and don't want to deal with the complexity of tuning and learning configs and things like that. Without AI, this unfortunately was not possible. For any other database, you get a system that can do many things powerfully; then you also need to spend the time learning the system. As the data lake became a lake house, you need data engineers to also level up and become kind of data lake administrators. Unfortunately, this never happened. Engineers are sold completely closed solutions even today for the sake of ease of use. But I think AI presents us an opportunity to bridge this gap once and for all for open source. I would like to reimagine the interface for the project and for any user to build a lake house to be a Claude Code or codex-like terminal experience. I think with lots of will and a little bit of skill, I think we can go get there much faster than we think. The vision is you create, build, optimize, deploy -- everything from a terminal agentic CLI. For starters
I built a toy MCP server with some 27 tools, over the weekend, with auto discovery of tables (local fs for now, we can bring catalogs and cloud storage in there pretty easily). I think we can even make |
Beta Was this translation helpful? Give feedback.
-
|
@soumilshah1995 on the original AGENTS.md/ARCHITECTURE.md/CLAUDE.md +1 . I think @yihua has a PR for CLAUDE.md file. we should consider that as well. but happy to take contributions around this. get it landed and keep chipping away |
Beta Was this translation helpful? Give feedback.
-
|
@soumilshah1995 thanks soumil for starting the discussion! If there is something you want to take up wondering if you want to just file a github issue and I can assign to you based on the discussion we are having. Also happy to jam over slack if you want feel free to reach out! |
Beta Was this translation helpful? Give feedback.
-
|
Let me recap a structured approach for integrating AI agents into Hudi in a practical and incremental way. Workstream 1: Agent-Friendly Repository Introduce lightweight, maintainable documentation to help both humans and AI agents better understand the project: Add a concise AGENTS.md describing key concepts, workflows, and entry points The goal is to improve discoverability without introducing heavy maintenance overhead. Workstream 2: Agent-Friendly Runtime Interface This aligns with the broader vision of a Claude/Codex-like experience for building and operating lakehouses. (Vinoth's proposal) Phase 1: Read-Only Investigation Tools Create a modern Hudi operator surface (via MCP and/or an enhanced CLI) focused on safe, read-only operations: List tables This phase focuses on observability and understanding, enabling users (and agents) to reason about table state safely. Phase 2: Guided Actions (Controlled Mutations) After validating read-only workflows and gathering community feedback, introduce controlled write operations: Trigger compaction and clustering The emphasis here is on safe, guided execution, not full automation. Workstream 3: Guided Labs and Onboarding Provide structured, hands-on experiences to help users get started quickly: // This is what Soumil proposed above or have already started efforts on The goal is to reduce the learning curve and accelerate adoption. Workstream 4: Troubleshooting and Diagnostics Enable agents to assist with common operational issues: Diagnose ingestion failures and stuck pipelines This workstream focuses on operational productivity and debugging efficiency. |
Beta Was this translation helpful? Give feedback.
-
|
hudi cli mcp support #18554 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone — I had a small proposal that could help make our open-source GitHub repos much more AI-agent friendly and easier to maintain.
Idea: introduce a few lightweight repo conventions so AI agents understand the project better and can contribute safely.
A.
AGENTS.mdA file that explains the repository structure, coding conventions, and important modules so AI agents understand how the project is organized before making changes.
B.
ARCHITECTURE.mdA high-level overview of the system architecture, key components, and data flow. This helps agents reason about design decisions instead of only reading code.
C. Documentation Skill (AI-based)
Introduce an automated documentation workflow where an AI agent periodically scans the repo and generates or updates documentation for APIs, modules, and examples. This could help keep open-source documentation consistently up to date.
Goal:
Make the repo easier for both humans and AI agents to understand, improve onboarding, and enable more reliable automated contributions.
Happy to experiment with a small prototype if folks think this could be useful.
Beta Was this translation helpful? Give feedback.
All reactions