feat: Weights & Biases sink integration#59
Merged
bordeauxred merged 5 commits intoMay 3, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a Weights & Biases integration for ClawLoop, featuring the WandbSink class for logging metrics such as reward curves, playbook growth, and layer state hashes. The changes include a demonstration script, a comprehensive test suite, and the addition of wandb as an optional dependency. Review feedback identified an incorrect version constraint for the wandb library in pyproject.toml and recommended synchronizing the internal step counter within log_iteration to prevent state inconsistencies when mixing logging APIs.
…aganthos#51) * refactor: extract _PurpleAgentBase for CAR + entropic adapters — fixes aganthos#40 Pulls the shared A2A scaffolding (tool schema conversion, assistant-msg normalization, session state, tool-call id reconciliation, harness update) into clawloop/environments/_purple_base.py. CAR and entropic adapters now override only the two bench-specific seams: _build_initial_messages and _format_a2a_response. No behavior change. 600 lines deleted, 379 added across the three files. * refactor: hoist stdlib imports in _purple_base, clarify reconcile scope Addresses Gemini review on aganthos#51: - Move socket, time, httpx to module-level imports (PEP 8). - Expand docstring + comment on _reconcile_tool_call_id to explain why it intentionally stops at the most-recent assistant message. No behavior change.
…nthos#58) Split the 486-line learning_loop() god-function into three focused helper classes plus five module-private glue functions. New modules: - clawloop/core/runner.py (EpisodeCollectorRunner — task sampling + 3-way adapter dispatch) - clawloop/core/archive_recorder.py (ArchiveRecorder — owns run-level counters + all archive writes) - clawloop/core/transaction.py (LayerTransaction — two-phase fb→optim→rollback protocol with cross-layer rollback invariant) learning_loop() body: 486 → 100 lines. loop.py total: 742 → 337 lines. Also: - Add Harness.pending_paradigm_insights() so LayerTransaction can query paradigm-tagged insights without touching _pending directly. - Move iter_cost accumulation outside the archive try-block in ArchiveRecorder so total_cost_tokens is tracked even when log_iteration fails. No public API change on learning_loop. Full suite unchanged (961 passed / 42 skipped) plus 28 new unit tests covering the extracted helpers (runner 97%, archive_recorder 92%, transaction 91% coverage).
Contributor
|
Thanks a lot @dantp-ai |
bordeauxred
pushed a commit
that referenced
this pull request
May 22, 2026
* feat: Weights & Biases sink integration * refactor: extract common base for purple-agent adapters (#40) (#51) * refactor: extract _PurpleAgentBase for CAR + entropic adapters — fixes #40 Pulls the shared A2A scaffolding (tool schema conversion, assistant-msg normalization, session state, tool-call id reconciliation, harness update) into clawloop/environments/_purple_base.py. CAR and entropic adapters now override only the two bench-specific seams: _build_initial_messages and _format_a2a_response. No behavior change. 600 lines deleted, 379 added across the three files. * refactor: hoist stdlib imports in _purple_base, clarify reconcile scope Addresses Gemini review on #51: - Move socket, time, httpx to module-level imports (PEP 8). - Expand docstring + comment on _reconcile_tool_call_id to explain why it intentionally stops at the most-recent assistant message. No behavior change. * refactor: extract helpers from learning_loop — fixes #39 (#58) Split the 486-line learning_loop() god-function into three focused helper classes plus five module-private glue functions. New modules: - clawloop/core/runner.py (EpisodeCollectorRunner — task sampling + 3-way adapter dispatch) - clawloop/core/archive_recorder.py (ArchiveRecorder — owns run-level counters + all archive writes) - clawloop/core/transaction.py (LayerTransaction — two-phase fb→optim→rollback protocol with cross-layer rollback invariant) learning_loop() body: 486 → 100 lines. loop.py total: 742 → 337 lines. Also: - Add Harness.pending_paradigm_insights() so LayerTransaction can query paradigm-tagged insights without touching _pending directly. - Move iter_cost accumulation outside the archive try-block in ArchiveRecorder so total_cost_tokens is tracked even when log_iteration fails. No public API change on learning_loop. Full suite unchanged (961 passed / 42 skipped) plus 28 new unit tests covering the extracted helpers (runner 97%, archive_recorder 92%, transaction 91% coverage). * fix: sync log_iteration with self._step * Ignore wandb dir --------- Co-authored-by: kiranannadatha8 <87536091+kiranannadatha8@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What changed and why.
Test plan
pytest tests/ -xpasses