The question
The perception harness currently does one thing: (image) → stage. But perception in the broader microscopy system is much wider:
- Stage classification (current): "what developmental stage is this?"
- Embryo detection: "what's in this field of view? where are the embryos?"
- Focus assessment: "is this volume in focus? what's the best focal plane?"
- Quality control: "is this volume usable or was there a motion artifact?"
- Calibration perception: "is the embryo well-covered by the scan range? are the two-point galvo-piezo calibration measurements being made at the right spots? is the resulting configuration producing good volumes?"
These are all perception tasks. They share harness infrastructure (VLM calls, image handling, prompt caching) but differ in prompts, output schemas, and available tools.
Generalizability
The harness adapts per stage by selecting different configurations of (prompt, representation, examples, model, tools). It could equally adapt per task type. Stage classification is one task configuration. Embryo detection is another. Calibration assessment is another. The harness is the same.
This raises the routing question: how does the harness know which task to run? Options:
- Task parameter on the call:
perceiver(task="detect", image=...)
- Task-specific perceiver instances:
stage_perceiver, detection_perceiver
- The orchestrator constructs the appropriate perceive function for each task
Perception-orchestrator communication
Currently: Python function call → return value. perceiver(embryo_id, timepoint, image, timestamp) → PerceptionOutput.
A richer interaction pattern emerges for multi-step tasks like calibration:
orchestrator: perceiver.detect(field_image) → embryo positions
orchestrator: move_stage(x1, y1)
orchestrator: perceiver.assess_focus(volume) → focus quality
orchestrator: perceiver.classify(image) → developmental stage
The intelligence is in the orchestrator's plan, not in the communication channel. The perceiver is a tool the orchestrator uses — sometimes for classification, sometimes for detection. The "dance" between them is the orchestrator making multiple calls with different tasks, not a conversation between two agents.
Is function-call API the right level?
Probably yes. Perception is fundamentally request-response. The alternative — two LLM agents conversing in natural language — adds latency and cost for marginal benefit. Structured function calls are more reliable and faster.
The richness comes from:
- The perceiver accumulating context over time (session)
- The orchestrator choosing which perception task to invoke
- The result being rich enough to inform the orchestrator's next decision
What this means for the harness
The harness should be able to handle different perception tasks without being rewritten for each one. The experiment framework already supports this — different experiments can implement different task types. The routing/configuration layer is what's missing.
Open questions
- Should calibration perception (currently in gently's calibration_tools.py — embryo coverage assessment, galvo-piezo two-point calibration, volume quality checks) move into gently-perception?
- How does task routing interact with the experiment framework? Are detection experiments a dimension, or a separate concern?
- Does the perceive function signature need to change for non-classification tasks (different output schema)?
The question
The perception harness currently does one thing:
(image) → stage. But perception in the broader microscopy system is much wider:These are all perception tasks. They share harness infrastructure (VLM calls, image handling, prompt caching) but differ in prompts, output schemas, and available tools.
Generalizability
The harness adapts per stage by selecting different configurations of (prompt, representation, examples, model, tools). It could equally adapt per task type. Stage classification is one task configuration. Embryo detection is another. Calibration assessment is another. The harness is the same.
This raises the routing question: how does the harness know which task to run? Options:
perceiver(task="detect", image=...)stage_perceiver,detection_perceiverPerception-orchestrator communication
Currently: Python function call → return value.
perceiver(embryo_id, timepoint, image, timestamp) → PerceptionOutput.A richer interaction pattern emerges for multi-step tasks like calibration:
The intelligence is in the orchestrator's plan, not in the communication channel. The perceiver is a tool the orchestrator uses — sometimes for classification, sometimes for detection. The "dance" between them is the orchestrator making multiple calls with different tasks, not a conversation between two agents.
Is function-call API the right level?
Probably yes. Perception is fundamentally request-response. The alternative — two LLM agents conversing in natural language — adds latency and cost for marginal benefit. Structured function calls are more reliable and faster.
The richness comes from:
What this means for the harness
The harness should be able to handle different perception tasks without being rewritten for each one. The experiment framework already supports this — different experiments can implement different task types. The routing/configuration layer is what's missing.
Open questions