Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ That's the loop: **talk → plan → inspect.** With hardware connected (drop `-
| [What Gently Can Do](docs/guides/capabilities.md) | Everyone | Perception, detection, plan mode, memory, mesh, safety |
| [Build a Plugin](docs/guides/build-a-plugin.md) | Developers | Create organism and hardware plugins for other modalities |
| [Hardware Setup](docs/guides/hardware-setup.md) | Labs | Connect a diSPIM, start the device layer, first acquisition |
| [Datastore Audit](docs/datastore-audit.md) | Developers | Decide whether Gently3 can evolve or needs a Gently4 store API |

## Architecture

Expand Down
96 changes: 96 additions & 0 deletions docs/datastore-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Datastore Audit

The FileStore safety work is a prerequisite for a larger question: whether the
current Gently3 datastore is sound enough to evolve, or whether a Gently4 store
API is needed. Do the audit before choosing a migration.

## Audit Questions

For every data product Gently creates or consumes, answer:

- Is this durable source data, derived/recomputable data, runtime state, or UI
cache?
- Where is it stored on disk?
- Is the path schema documented and safe against untrusted identifiers?
- Is the file format stable, versioned, and readable without importing runtime
hardware dependencies?
- Can a biologist browse it by session, sample, timepoint, modality, and
provenance?
- Can downstream analysis find the raw data and the metadata needed to interpret
it?
- Is there data Gently uses but does not persist?
- Is there data Gently stores but never reads, displays, exports, or validates?

## Audit Command

Run a first-pass inventory against a Gently3/FileStore root:

```shell
python -m gently.core.datastore_audit D:/Gently3
```

Use JSON output for scripts:

```shell
python -m gently.core.datastore_audit D:/Gently3 --json --output audit.json
```

The command counts session metadata, timelines/events, interaction logs,
snapshots, volumes, sidecars, sample records, projections, perception traces,
debug exports, profile spans, campaign plans, incoming files, and logs. It also
flags obvious browseability/provenance gaps, including missing `session.yaml`,
unreadable YAML, volume TIFFs without `.meta.yaml` sidecars, snapshot TIFFs
without sidecars, and sample directories without `embryo.yaml`.

## Inventory Template

| Data product | Current path/table | Class | Producer | Consumer | Browse need | Gap |
| --- | --- | --- | --- | --- | --- | --- |
| session metadata | `sessions/*/session.yaml` | durable | launcher/session manager | UI, resume, audit | list sessions | check schema version |
| timeline/events | `timeline.jsonl`, `events.jsonl` | durable | event capture | replay, debug export | filter by time/type | standardize event names |
| interaction log | `interaction_log.jsonl` | durable | agent runtime | debug export | inspect chat/tool flow | include profile links |
| embryo/sample state | `embryos/*/embryo.yaml` | durable | marking/calibration/acquisition | tools, UI, resume | browse sample state | generalize beyond embryos |
| volumes/snapshots | `volumes/*.tif`, `snapshots/*.tif` | durable source | acquisition | perception, analysis | preview, export | verify metadata sidecars |
| projections | `projections/*.jpg` | derived | store/perception | UI | preview | mark recomputable |
| perception traces | `traces/*.json`, `predictions.jsonl` | durable derived | perception | UI/debug/eval | inspect reasoning | link to source volume |
| plans/campaigns | `agent/campaigns/*` | durable | plan mode | UI, execution | browse by campaign | align with session data |
| debug bundles | `debug_exports/*` | derived | debug exporter | coding agent | download/share | retention policy |

## Biologist-Facing Browser

A useful data browser should organize by:

- session and experimental intent
- sample or embryo
- timepoint
- modality: overview, lightsheet volume, projection, perception, plan, event
- provenance: acquisition settings, calibration, exposure, software version,
operator action, and agent decision

The browser should distinguish raw source data from derived previews and should
always expose the raw file path/export path for analysis outside Gently.

## Gently4 Decision Criteria

Stay on Gently3 and migrate incrementally if:

- path schemas can be versioned in place
- all durable data can be discovered from `sessions/`
- missing metadata can be added as sidecars without breaking existing sessions
- the UI can browse the store without special-case crawlers

Define a Gently4 API if:

- durable data are split across incompatible roots
- old sessions cannot be safely migrated or indexed
- common queries require scanning many large files
- sample abstractions cannot generalize without changing the store contract
- provenance links between raw data, perception, plans, and operator actions are
not representable in the current layout

## Safety Tie-In

The path and YAML hardening in this PR should remain part of any future
datastore design. A biologist-facing browser or migration API cannot be trusted
unless user-controlled identifiers stay inside the store root and legacy files
fail closed when they contain unsafe constructors.
12 changes: 7 additions & 5 deletions gently/app/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -1303,13 +1303,15 @@ def _compute_imported_dose(self, source_session_id: str, embryo_id: str) -> Dict
if not self.store:
return result

# FileStore exposes _session_dir(session_id) → resolved Path.
session_dir_fn = getattr(self.store, '_session_dir', None)
sd = session_dir_fn(source_session_id) if callable(session_dir_fn) else None
if sd is None:
# FileStore exposes _embryo_dir(session_id, embryo_id) with validation.
embryo_dir_fn = getattr(self.store, '_embryo_dir', None)
if not callable(embryo_dir_fn):
return result

vols_dir = Path(sd) / 'embryos' / embryo_id / 'volumes'
try:
vols_dir = Path(embryo_dir_fn(source_session_id, embryo_id)) / 'volumes'
except (FileNotFoundError, ValueError):
return result
if not vols_dir.is_dir():
return result

Expand Down
184 changes: 184 additions & 0 deletions gently/core/datastore_audit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
"""Audit a Gently3 FileStore root for data inventory and obvious gaps."""

from __future__ import annotations

import argparse
import json
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Dict, List, Optional

import yaml


@dataclass
class SessionAudit:
"""Audit summary for one session directory."""

session_dir: str
session_id: Optional[str]
artifact_counts: Dict[str, int] = field(default_factory=dict)
gaps: List[str] = field(default_factory=list)


@dataclass
class DatastoreAudit:
"""Top-level datastore audit report."""

root: str
session_count: int
artifact_counts: Dict[str, int]
gaps: List[str]
sessions: List[SessionAudit]

def to_dict(self) -> Dict:
return asdict(self)


_COUNT_PATTERNS = {
"session_metadata": ["session.yaml"],
"timeline_logs": ["timeline.jsonl", "events.jsonl"],
"interaction_logs": ["interaction_log.jsonl"],
"snapshots": ["snapshots/*.tif"],
"snapshot_metadata": ["snapshots/*.meta.yaml"],
"sample_records": ["embryos/*/embryo.yaml"],
"volumes": ["embryos/*/volumes/*.tif"],
"volume_metadata": ["embryos/*/volumes/*.meta.yaml"],
"projections": ["embryos/*/projections/*"],
"perception_predictions": ["embryos/*/predictions.jsonl"],
"perception_traces": ["embryos/*/traces/*.json"],
"debug_exports": ["debug_exports/**/debug_context.md"],
"profile_spans": ["profile.jsonl", "profile_spans.jsonl"],
}


def audit_datastore(root: Path) -> DatastoreAudit:
"""Scan a FileStore root and return a structured audit report."""
root = Path(root)
sessions_root = root / "sessions"
sessions: List[SessionAudit] = []
gaps: List[str] = []
totals: Dict[str, int] = {key: 0 for key in _COUNT_PATTERNS}
totals.update({"campaign_plans": 0, "plan_history": 0, "incoming_files": 0, "logs": 0})

if not sessions_root.exists():
gaps.append(f"missing sessions directory: {sessions_root}")
else:
for session_dir in sorted(p for p in sessions_root.iterdir() if p.is_dir()):
session = _audit_session(session_dir)
sessions.append(session)
for key, count in session.artifact_counts.items():
totals[key] = totals.get(key, 0) + count
gaps.extend(f"{Path(session.session_dir).name}: {gap}" for gap in session.gaps)

totals["campaign_plans"] = _count(root / "agent" / "campaigns", "**/plan/current.yaml")
totals["plan_history"] = _count(root / "agent" / "campaigns", "**/plan/history/*.yaml")
totals["incoming_files"] = _count(root / "incoming", "*")
totals["logs"] = _count(root / "logs", "*")

return DatastoreAudit(
root=str(root),
session_count=len(sessions),
artifact_counts=totals,
gaps=gaps,
sessions=sessions,
)


def format_audit_markdown(report: DatastoreAudit) -> str:
"""Render an audit report as concise Markdown."""
lines = [
f"# Datastore Audit: `{report.root}`",
"",
f"Sessions: {report.session_count}",
"",
"## Artifact Counts",
"",
]
for key, count in sorted(report.artifact_counts.items()):
lines.append(f"- {key}: {count}")

lines.extend(["", "## Gaps", ""])
lines.extend(f"- {gap}" for gap in report.gaps) if report.gaps else lines.append("- none")

lines.extend(["", "## Sessions", ""])
for session in report.sessions:
label = session.session_id or Path(session.session_dir).name
lines.append(f"- `{label}`: {sum(session.artifact_counts.values())} counted artifacts")
return "\n".join(lines) + "\n"


def _audit_session(session_dir: Path) -> SessionAudit:
counts = {
key: sum(_count(session_dir, pattern) for pattern in patterns)
for key, patterns in _COUNT_PATTERNS.items()
}
gaps: List[str] = []
session_data = _read_yaml(session_dir / "session.yaml", gaps)
session_id = str(session_data.get("session_id")) if session_data else None

if not session_data:
gaps.append("missing or unreadable session.yaml")

for volume in session_dir.glob("embryos/*/volumes/*.tif"):
meta = volume.with_suffix(".meta.yaml")
if not meta.exists():
gaps.append(f"volume missing metadata sidecar: {volume.relative_to(session_dir)}")

for snapshot in session_dir.glob("snapshots/*.tif"):
meta = snapshot.with_suffix(".meta.yaml")
if not meta.exists():
gaps.append(f"snapshot missing metadata sidecar: {snapshot.relative_to(session_dir)}")

for embryo in session_dir.glob("embryos/*"):
if embryo.is_dir() and not (embryo / "embryo.yaml").exists():
gaps.append(f"sample missing embryo.yaml: {embryo.relative_to(session_dir)}")

return SessionAudit(
session_dir=str(session_dir),
session_id=session_id,
artifact_counts=counts,
gaps=gaps,
)


def _count(root: Path, pattern: str) -> int:
if not root.exists():
return 0
return sum(1 for p in root.glob(pattern) if p.is_file())


def _read_yaml(path: Path, gaps: List[str]) -> Dict:
if not path.exists():
return {}
try:
data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
except Exception as exc:
gaps.append(f"unreadable YAML {path.name}: {exc}")
return {}
return data if isinstance(data, dict) else {}


def main(argv: Optional[List[str]] = None) -> int:
parser = argparse.ArgumentParser(description="Audit a Gently3 datastore root")
parser.add_argument("root", type=Path, help="Gently3/FileStore root")
parser.add_argument("--json", action="store_true", help="Write JSON instead of Markdown")
parser.add_argument("--output", type=Path, help="Optional report output path")
args = parser.parse_args(argv)

report = audit_datastore(args.root)
text = (
json.dumps(report.to_dict(), indent=2)
if args.json
else format_audit_markdown(report)
)
if args.output:
args.output.parent.mkdir(parents=True, exist_ok=True)
args.output.write_text(text, encoding="utf-8")
else:
print(text, end="" if text.endswith("\n") else "\n")
return 1 if report.gaps else 0


if __name__ == "__main__":
raise SystemExit(main())
Loading