DM-54879: Support reprocessing when upstream outputs are selectively retained#561
DM-54879: Support reprocessing when upstream outputs are selectively retained#561hsinfang wants to merge 6 commits into
Conversation
d4f69bd to
4bf23e3
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
==========================================
+ Coverage 88.79% 88.81% +0.02%
==========================================
Files 160 160
Lines 22120 22326 +206
Branches 2625 2656 +31
==========================================
+ Hits 19641 19829 +188
- Misses 1843 1853 +10
- Partials 636 644 +8 ☔ View full report in Codecov by Harness. |
4bf23e3 to
d6ef7aa
Compare
| return zstandard.ZstdCompressionDict(b"") | ||
| self.comms.log.info("Training compression dictionary.") | ||
| training_inputs: list[bytes] = [] | ||
| training_inputs: list[bytes | bytearray | memoryview[int]] = [] |
There was a problem hiding this comment.
I'm curious where this is coming from; AFAIK we don't use bytearray or memoryview[int] for any of these.
There was a problem hiding this comment.
I took it from mypy's suggestion directly, but this has now been fixed by ac0bcb1
3e4cff4 to
52abd2a
Compare
8237d96 to
4bdec93
Compare
|
@hsinfang are you imminently merging this or is it going to be a few days? I am trying to sync with the v30 release so wondered what your plan was. |
|
@timj this won't be imminent, and can wait longer too if that makes other things easier. |
ed486d7 to
0fad777
Compare
The skip_existing_in behavior of QuantumGraphBuilder was previously only covered through test_separable_pipeline_executor.py, where SeparablePipelineExecutor drives AllDimensionsQuantumGraphBuilder. No tests exercised the builder directly at the unit level.
Extract the read-only metadata check and the skeleton mutation in _skip_quantum_if_metadata_exists into two helpers _compute_skip_decision and _apply_skip_decision. No behavior change.
0fad777 to
9e00c45
Compare
…quanta Daytime AP runs against data produced by Prompt Processing, which does not retain all intermediate outputs. With --skip-existing-in, tasks whose metadata exists are skipped even when their outputs are absent. When a downstream task needs to run, it may not see some inputs and is dropped as no work found. retained_dataset_types provides dataset types expected to exist in skip_existing_in. The non-retained types trigger backward unskipping of the ancestor quanta needed to regenerate them.
SeparablePipelineExecutor is not used by pipetask, but we might as well extend the same option and get tested there.
9e00c45 to
abb7cc9
Compare
Checklist
package-docs builddoc/changes