Add Task.block_filter for eager block pruning before workers spawn by mzouink · Pull Request #69 · funkelab/daisy

mzouink · 2026-06-08T00:44:47Z

A new optional block_filter parameter on daisy.Task lets callers drop blocks from the dependency graph at construction time, before any worker process is spawned and before the scheduler begins handing blocks out.

This is distinct from check_function, which runs lazily per block as a worker tries to acquire one and marks the block as already-completed. block_filter runs once, in the master, and the filtered blocks never count toward total_block_count and are never offered to a worker.

Motivation: large blockwise inference jobs over sparse volumes (e.g. restricted to a coarse inference mask) often have tens of millions of candidate blocks but only a small fraction of real work. Today num_workers workers are bsub-launched up-front regardless, then sit idle while the master walks the block grid; with block_filter the graph collapses to just the live blocks before the worker pool is brought up.

Wiring:

Task.__init__ accepts block_filter: Optional[Callable[[Block], bool]]
DependencyGraph.__add_task_dependency_graph forwards it to the inner BlockwiseDependencyGraph
When set, BlockwiseDependencyGraph materializes the surviving blocks per level once in __init__. num_blocks, num_roots, and level_blocks then read from the cached filtered set. The original lazy enumeration path is preserved as _unfiltered_level_blocks and used when no filter is supplied — no behavior change for existing callers.

Tests in tests/test_scheduler.py cover the typical case (filter half the blocks, scheduler only ever returns the kept ones) and the zero-blocks-after-filter degenerate case.

A new optional `block_filter` parameter on `daisy.Task` lets callers drop blocks from the dependency graph at construction time, before any worker process is spawned and before the scheduler begins handing blocks out. This is distinct from `check_function`, which runs lazily per block as a worker tries to acquire one and marks the block as already-completed. `block_filter` runs once, in the master, and the filtered blocks never count toward `total_block_count` and are never offered to a worker. Motivation: large blockwise inference jobs over sparse volumes (e.g. restricted to a coarse inference mask) often have tens of millions of candidate blocks but only a small fraction of real work. Today `num_workers` workers are bsub-launched up-front regardless, then sit idle while the master walks the block grid; with `block_filter` the graph collapses to just the live blocks before the worker pool is brought up. Wiring: - `Task.__init__` accepts `block_filter: Optional[Callable[[Block], bool]]` - `DependencyGraph.__add_task_dependency_graph` forwards it to the inner `BlockwiseDependencyGraph` - When set, `BlockwiseDependencyGraph` materializes the surviving blocks per level once in `__init__`. `num_blocks`, `num_roots`, and `level_blocks` then read from the cached filtered set. The original lazy enumeration path is preserved as `_unfiltered_level_blocks` and used when no filter is supplied — no behavior change for existing callers. Tests in `tests/test_scheduler.py` cover the typical case (filter half the blocks, scheduler only ever returns the kept ones) and the zero-blocks-after-filter degenerate case.

When `block_filter` is set, `_apply_block_filter` can take many seconds to minutes on large block grids (e.g. ~14M candidate blocks for a sparse-mask volumetric inference). Emit an INFO log at start, drive a tqdm progress bar across all levels, and log the surviving block count at the end so callers can tell whether the master is making progress or stuck. Per-level totals are computed analytically up-front so tqdm reports a real total without exhausting the underlying generator.

mzouink · 2026-06-08T01:33:30Z

The main problem I am solving here is the idle time for the workers when there are a lot of blocks to skip
the block filter run before spawn the workers
the new logic:

INFO:cellmap_flow.blockwise.blockwise_processor:Processing entire dataset: [0:1638400, 0:917504, 0:307620] (1638400, 917504, 307620)
INFO:daisy.dependency_graph:Task predict_fishcellmap_flow_20260607_162955: starting block_filter on 14280198 candidate blocks across 1 levels...
block_filter(predict_fishcellmap_flow_20260607_162955): 100%|██████████| 14280198/14280198 [15:24<00:00, 15445.65block/s]
INFO:daisy.dependency_graph:Task predict_fishcellmap_flow_20260607_162955: block_filter kept 1835712 / 14280198 blocks (12.85%) — workers will only see these.
INFO:daisy.worker_pool:Worker worker (logdir=daisy_logs:hostname=10.36.106.18:port=34517:task_id=predict_fishcellmap_flow_20260607_162955:worker_id=0) (pid 2478510) exited normally
INFO:daisy.worker_pool:Worker worker (logdir=daisy_logs:hostname=10.36.106.18:port=34517:task_id=predict_fishcellmap_flow_20260607_162955:worker_id=1) (pid 2478512) exited normally
INFO:daisy.worker_pool:Worker worker (logdir=daisy_logs:hostname=10.36.106.18:port=34517:task_id=predict_fishcellmap_flow_20260607_162955:worker_id=2) (pid 2478514) exited normally
predict_fishcellmap_flow_20260607_162955 ▶:   0%|          | 239/1835712 [04:16<493:19:29,  1.03blocks/s, ⧗=0, ▶=3, ✔=239, ✗=0, ∅=0]

Copilot

Pull request overview

This PR introduces an optional Task.block_filter callback intended to eagerly prune blocks in the master process at dependency-graph construction time, so filtered blocks are excluded from scheduling and from total_block_count. This targets workloads with extremely sparse “real work” over large candidate block grids.

Changes:

Add block_filter parameter to daisy.Task and forward it into dependency-graph construction.
Implement eager filtering/caching of per-level blocks in BlockwiseDependencyGraph, updating num_blocks, num_roots, and level_blocks to reflect the filtered set.
Add scheduler tests verifying that filtered blocks are never scheduled and that the “all blocks dropped” case yields no work.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`tests/test_scheduler.py`	Adds tests for block pruning behavior and the zero-block degenerate case.
`daisy/task.py`	Exposes the new `block_filter` parameter on the public `Task` API and documents it.
`daisy/dependency_graph.py`	Forwards `block_filter` into `BlockwiseDependencyGraph` and materializes filtered blocks eagerly for counts/enumeration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        write_roi,
        process_function,
        check_function=None,
+        block_filter=None,
        init_callback_fn=None,


+        """Materialize every level's blocks once and keep only those passing
+        ``block_filter``. After this, ``num_blocks``, ``num_roots``, and
+        ``level_blocks`` all reflect the surviving set.
+
+        Walks all blocks across all levels with a tqdm progress bar so callers


    def num_roots(self):
+        if self._filtered_blocks_by_level is not None:
+            return len(self._filtered_blocks_by_level[0])
        return self._num_level_blocks(0)


rhoadesScholar · 2026-06-10T00:15:59Z

Would be good to have:

A read_write_conflict=True test to cover level > 0 filtering and num_roots().
A test combining block_filter + check_function on the same task.

mzouink added 2 commits June 7, 2026 20:38

rhoadesScholar requested a review from Copilot June 9, 2026 23:45

Copilot started reviewing on behalf of rhoadesScholar June 9, 2026 23:45 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Task.block_filter for eager block pruning before workers spawn#69

Add Task.block_filter for eager block pruning before workers spawn#69
mzouink wants to merge 2 commits into
funkelab:masterfrom
mzouink:add-block-filter

mzouink commented Jun 8, 2026

Uh oh!

mzouink commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

rhoadesScholar commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mzouink commented Jun 8, 2026

Uh oh!

mzouink commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

rhoadesScholar commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants