docs: Add KEP-0812 Composable Kale Notebooks proposal#847
Conversation
Adds the design proposal for composable notebooks — extending Kale to support composition of multiple notebooks into a single KFP pipeline via a new `notebook` cell type. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Eder Ignatowicz <ignatowicz@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
StefanoFioravanzo
left a comment
There was a problem hiding this comment.
I love this KEP, overall it is aligned with what we discussed already and provides a comprehensive overview of what we want to build.
I feel like we need to better explain just a couple of aspects, which I commented above
|
|
||
| 4. **Type inference uses name heuristics.** KFP artifact types are inferred from variable names using Kale's existing type map (`model`→Model, `dataset`→Dataset, `metrics`→Metrics, etc.). Future work may add explicit type annotations. | ||
|
|
||
| 5. **`notebook` cells break the code merge chain.** In Kale, untagged cells merge into the previous step. A `notebook:` cell is a reference to another notebook, not executable code — subsequent untagged cells must NOT be merged into it. The `notebook` type must be treated as a boundary (like `imports`, `skip`, `pipeline-parameters`) that stops the merge. Untagged cells after a `notebook:` cell should belong to the next explicit `step:` cell, not to the notebook reference. |
There was a problem hiding this comment.
Are we sure that imports and pipeline-parameters act as a boundary? skip definitely does, but I don't remember if the others do too. Asking just to make sure
There was a problem hiding this comment.
@StefanoFioravanzo , I have checked the parser, and the three don't behave the same. After a skip cell, a following untagged cell merges into the step before it (skip is transparent, it doesn't change the active step). Afterimportsor pipeline-parameters, a following untagged cell goes into that block, not a step. So none of them send a following cell to the next step:, which is what we want for notebook:. So notebook: is its own case, not the same as any of those three. We would need to fix the wording to describe its own rule , @ederign please have a look
| └─────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| Kale detects that `train.ipynb` uses `dataset` and `features` (defined by `step:preprocess`) and that it produces `model` and `test_data` (used by `step:evaluate`). The compiled pipeline has `preprocess_step` (a component), `train_pipeline` (a sub-pipeline with its own internal steps), and `evaluate_step` (a component). Data flows automatically between all three — variables cross the step/notebook boundary the same way they cross step/step boundaries. |
There was a problem hiding this comment.
The train notebook cannot be standalone, since it "needs" to depend on dataset and features variables for it to be acceptable for this pattern. I wonder: how will someone work on and test this notebook?
This is partially mentioned below, but I feel like we haven't really made explicit what is the desired user experience in this case. It might be acceptable for a v1 to indeed require a sub-notebook to not define some of its variables. For a v2, I feel like we need to overcome this limitation
Thinking out loud: If I am developing train and I know someone will attach to it, I will probably define stub variables for me to test the notebook, with dummy values. These stub variables can indeed become part of the signature of the notebook. Below, in Decision 2, we mention the possibility of using notebook-outputs as mechanism to define outputs. Doing something similar for inputs might be interesting, or extending the pipeline-parameters mechanism.
Anyway - I think this part is still up for debate. I think we should add some specific considerations about this in the KEP
There was a problem hiding this comment.
@StefanoFioravanzo yeah, this is real gap, right now a sub-notebook has to leave some of its variables undefined for us to pick them up as inputs, which is exactly why it can't run standalone. And if we add stubs to test it, those variables become defined, so we stop picking them up, the real upstream values never get passed in, and it silently runs on the dummy values. So we can't get both standalone-testable and composable unless the stubs are a declared signature, like you said
pipeline-parameters fits scalar inputs, but dataset/features are artifacts, so those would need their own declaration, a notebook-inputs alongside the notebook-outputs idea. Probably both, by type.
So v1 keeps the current behavior which is sub-notebook leaves some variables undefined , and v2 adds explicit input/output declarations for the reusable case. We could add a short section on this @ederign , please have a look
Summary
Adds the design proposal for KEP-0812: Composable Kale Notebooks.
This KEP is the result of collaborative work between @ederign, @StefanoFioravanzo, and @Ya-shh as part of the GSoC 2026 program. Yash delivered a solid POC that validated the core technical approach — compiling each notebook as a KFP sub-pipeline (GraphComponent) with automatic boundary variable detection via AST/PyFlakes analysis. This KEP builds on those findings to define the full design.
Key design decisions
notebookcell type in Kale's tag language. Users addnotebook:traincells alongsidestep:cells in the same notebook. The.ipynbfile is the composition format — no external config files.What's in the proposal
Ref: #812