Skip to content

Idempotency and multi-agent collaboration #55

@etaroza

Description

@etaroza

Idempotent Operations & Multi-Agent Coordination

Problem Statement

The dialectical framework performs multi-step graph operations that can fail mid-execution. Currently, each save() and connect() call auto-commits independently to the graph database. If an operation fails partway through:

  1. Orphaned nodes: Partial graph state with no way to detect or clean up
  2. Duplicate work: Re-running creates duplicates instead of resuming
  3. No authorship: Can't distinguish "my incomplete work" from "someone else's work"
  4. Wasted LLM calls: Expensive AI reasoning is lost on failure

Example: Action-Reflection Failure

think_action_reflection.py performs 22+ sequential save/connect operations:

1. problem_rationale.save()          ✓ committed
2. ac_re_wu.save()                   ✓ committed
3. ac_re_wu.rationales.connect()     ✓ committed
4. [loop] component.save()           ✓ committed (×6)
5. [loop] manager.connect()          ✓ committed (×6)
6. transformation.save()             ✓ committed
7. transformation.ac_re.connect()    ✓ committed
8. wu.transformation.connect()       ✓ committed
9. rationale1.save()                 ✓ committed
10. transition1.save()               ✗ FAILURE HERE
    ─────────────────────────────────
11. transition1.source.connect()     ✗ never executed
12. transition1.target.connect()     ✗ never executed
... (more operations)

Result: Graph contains partial transformation with no transitions. Re-running would create duplicate transformation, rationales, and components.


Design Goals

  1. Resumability: Detect incomplete operations and complete them
  2. Idempotency: Re-running same operation produces same result (no duplicates)
  3. Multi-agent support: Each agent tracks its own work independently
  4. Minimal overhead: Don't complicate simple operations
  5. Graph-native: Use the graph DB itself as the state tracker (no external dependencies)

Proposed Solution: Operation Tracking Pattern

Core Concept

Before doing work, create an Operation node that tracks the unit of work. Link all created artifacts to it. Mark complete when done. On failure/restart, query pending operations and resume.

┌─────────────────────────────────────────────────────┐
│ 1. Create Operation(status="pending")               │
│ 2. Do LLM work (expensive, cache results on op)     │
│ 3. Create nodes, link each to Operation             │
│ 4. Create relationships                             │  ← Failure here?
│ 5. Mark Operation(status="complete")                │    Query op.artifacts,
└─────────────────────────────────────────────────────┘    complete what's missing

Operation Node

class Operation(BaseNode):
    """
    Tracks a unit of work for resumability and authorship.

    The Operation node lives in the same graph database as domain nodes,
    providing atomic state tracking without external dependencies.
    """

    # What kind of work
    operation_type: str              # "action_reflection", "synthesis", "polarity", etc.

    # Lifecycle
    status: str = "pending"          # pending → complete → failed
    started_at: datetime             # When operation began
    completed_at: Optional[datetime] # When operation finished (if complete)

    # Authorship
    created_by: str                  # Agent/app identifier (e.g., "agent-123", "app-main")

    # Target
    target_uid: str                  # The node being operated on (e.g., WisdomUnit uid)
    target_type: str                 # The type of target node (e.g., "WisdomUnit")

    # Cached LLM results (to avoid re-running expensive calls on resume)
    cached_results: Optional[str]    # JSON-serialized LLM response data

    # All nodes created during this operation
    artifacts: ClassVar[RelationshipManager[BaseNode]] = RelationshipTo(
        "BaseNode",
        "CREATED_ARTIFACT"
    )

Idempotency Key

Operations are uniquely identified by the combination:

idempotency_key = hash(created_by + operation_type + target_uid)

This ensures:

  • Same agent doing same operation on same target → finds existing operation
  • Different agent doing same operation → creates separate operation
  • Same agent doing different operation → creates separate operation

Implementation Plan

Phase 1: Operation Node Foundation

1.1 Create Operation node class

# src/dialectical_framework/graph/nodes/operation.py

from __future__ import annotations

from datetime import datetime
from typing import ClassVar, Optional, TYPE_CHECKING

from dialectical_framework.graph.nodes.base_node import BaseNode
from dialectical_framework.graph.relationship_manager import (
    RelationshipManager,
    RelationshipTo,
)

if TYPE_CHECKING:
    pass


class Operation(BaseNode):
    """Tracks a unit of work for resumability and authorship."""

    operation_type: str
    status: str = "pending"  # pending, complete, failed
    started_at: datetime
    completed_at: Optional[datetime] = None
    created_by: str
    target_uid: str
    target_type: str
    cached_results: Optional[str] = None  # JSON serialized
    error_message: Optional[str] = None   # If status=failed

    artifacts: ClassVar[RelationshipManager[BaseNode]] = RelationshipTo(
        "BaseNode",
        "CREATED_ARTIFACT",
        cardinality=(0, None),
    )

1.2 Create OperationManager service

# src/dialectical_framework/graph/operation_manager.py

from __future__ import annotations

import json
from datetime import datetime
from typing import Any, Optional, TYPE_CHECKING

from dependency_injector.wiring import Provide, inject
from gqlalchemy import Memgraph, Neo4j

from dialectical_framework.enums.di import DI
from dialectical_framework.graph.nodes.operation import Operation

if TYPE_CHECKING:
    from dialectical_framework.graph.nodes.base_node import BaseNode


class OperationManager:
    """Manages operation lifecycle for idempotent, resumable work."""

    @inject
    def find_pending(
        self,
        agent_id: str,
        operation_type: str,
        target_uid: str,
        graph_db: Union[Memgraph, Neo4j] = Provide[DI.graph_db],
    ) -> Optional[Operation]:
        """Find an incomplete operation for this agent/type/target."""
        query = """
        MATCH (op:Operation {
            created_by: $agent_id,
            operation_type: $op_type,
            target_uid: $target_uid,
            status: 'pending'
        })
        RETURN op
        LIMIT 1
        """
        results = list(graph_db.execute_and_fetch(query, {
            "agent_id": agent_id,
            "op_type": operation_type,
            "target_uid": target_uid,
        }))

        if results:
            return results[0]["op"]
        return None

    def start(
        self,
        agent_id: str,
        operation_type: str,
        target_uid: str,
        target_type: str,
    ) -> Operation:
        """Start a new operation (or return existing pending one)."""
        # Check for existing pending operation
        existing = self.find_pending(agent_id, operation_type, target_uid)
        if existing:
            return existing

        # Create new operation
        operation = Operation(
            operation_type=operation_type,
            status="pending",
            started_at=datetime.utcnow(),
            created_by=agent_id,
            target_uid=target_uid,
            target_type=target_type,
        )
        operation.save()
        return operation

    def cache_results(self, operation: Operation, results: dict[str, Any]) -> None:
        """Cache LLM results on operation for resume capability."""
        operation.cached_results = json.dumps(results)
        operation.save()

    def get_cached_results(self, operation: Operation) -> Optional[dict[str, Any]]:
        """Retrieve cached LLM results."""
        if operation.cached_results:
            return json.loads(operation.cached_results)
        return None

    def track_artifact(self, operation: Operation, node: BaseNode) -> None:
        """Link a created node to this operation."""
        operation.artifacts.connect(node)

    def complete(self, operation: Operation) -> None:
        """Mark operation as successfully completed."""
        operation.status = "complete"
        operation.completed_at = datetime.utcnow()
        operation.save()

    def fail(self, operation: Operation, error: str) -> None:
        """Mark operation as failed with error message."""
        operation.status = "failed"
        operation.completed_at = datetime.utcnow()
        operation.error_message = error
        operation.save()

    @inject
    def get_artifacts_by_type(
        self,
        operation: Operation,
        node_type: str,
        graph_db: Union[Memgraph, Neo4j] = Provide[DI.graph_db],
    ) -> list[BaseNode]:
        """Get all artifacts of a specific type from this operation."""
        query = """
        MATCH (op:Operation {uid: $op_uid})-[:CREATED_ARTIFACT]->(n)
        WHERE $node_type IN labels(n)
        RETURN n
        """
        results = list(graph_db.execute_and_fetch(query, {
            "op_uid": operation.uid,
            "node_type": node_type,
        }))
        return [r["n"] for r in results]

    @inject
    def cleanup_failed(
        self,
        agent_id: str,
        graph_db: Union[Memgraph, Neo4j] = Provide[DI.graph_db],
    ) -> int:
        """Clean up failed operations and their orphaned artifacts."""
        # Find failed operations for this agent
        query = """
        MATCH (op:Operation {created_by: $agent_id, status: 'failed'})
        OPTIONAL MATCH (op)-[:CREATED_ARTIFACT]->(artifact)
        DETACH DELETE op, artifact
        RETURN count(op) as deleted_count
        """
        results = list(graph_db.execute_and_fetch(query, {"agent_id": agent_id}))
        return results[0]["deleted_count"] if results else 0

Phase 2: Integration with Existing Operations

2.1 Update ThinkActionReflection

# Simplified example of updated think() method

async def think(
    self,
    focus: WheelSegment,
    agent_id: str = "default"
) -> list[Transition]:
    wu = self._wheel.wisdom_unit_at(focus)
    op_manager = OperationManager()

    # Start or resume operation
    operation = op_manager.start(
        agent_id=agent_id,
        operation_type="action_reflection",
        target_uid=wu.uid,
        target_type="WisdomUnit",
    )

    try:
        # Check for cached LLM results (resume case)
        cached = op_manager.get_cached_results(operation)

        if cached:
            dc_deck_dto = DialecticalComponentsDeckDto(**cached["dc_deck"])
            reciprocal_sol_dto = ReciprocalSolutionDto(**cached["reciprocal"])
        else:
            # Do LLM work (expensive)
            dc_deck_dto, reciprocal_sol_dto = await asyncio.gather(
                self.action_reflection(focus=wu),
                self.reciprocal_solution(focus=wu)
            )
            # Cache results immediately
            op_manager.cache_results(operation, {
                "dc_deck": dc_deck_dto.dict(),
                "reciprocal": reciprocal_sol_dto.dict(),
            })

        # Check what artifacts already exist (resume case)
        existing_transformations = op_manager.get_artifacts_by_type(
            operation, "Transformation"
        )
        existing_transitions = op_manager.get_artifacts_by_type(
            operation, "Transition"
        )

        # Create only what's missing...
        if not existing_transformations:
            transformation = Transformation()
            transformation.save()
            op_manager.track_artifact(operation, transformation)
        else:
            transformation = existing_transformations[0]

        # ... continue with remaining work, tracking each artifact ...

        # Mark complete when all done
        op_manager.complete(operation)

        return transitions

    except Exception as e:
        op_manager.fail(operation, str(e))
        raise

2.2 Update other operations similarly

Apply same pattern to:

  • think_polarity.py - Polarity analysis
  • think_causality.py - Causality analysis
  • think_synthesis.py - Synthesis generation
  • think_constructive_convergence_auditor.py - Auditing

Phase 3: Multi-Agent Query Support

3.1 Add query methods for agent coordination

# Additional methods on OperationManager

@inject
def get_my_pending_operations(
    self,
    agent_id: str,
    graph_db: Union[Memgraph, Neo4j] = Provide[DI.graph_db],
) -> list[Operation]:
    """Get all pending operations for this agent."""
    query = """
    MATCH (op:Operation {created_by: $agent_id, status: 'pending'})
    RETURN op
    ORDER BY op.started_at
    """
    results = list(graph_db.execute_and_fetch(query, {"agent_id": agent_id}))
    return [r["op"] for r in results]

@inject
def is_target_being_worked_on(
    self,
    target_uid: str,
    operation_type: str,
    exclude_agent: Optional[str] = None,
    graph_db: Union[Memgraph, Neo4j] = Provide[DI.graph_db],
) -> bool:
    """Check if another agent is working on this target."""
    query = """
    MATCH (op:Operation {
        target_uid: $target_uid,
        operation_type: $op_type,
        status: 'pending'
    })
    WHERE op.created_by <> $exclude_agent OR $exclude_agent IS NULL
    RETURN count(op) > 0 as in_progress
    """
    results = list(graph_db.execute_and_fetch(query, {
        "target_uid": target_uid,
        "op_type": operation_type,
        "exclude_agent": exclude_agent,
    }))
    return results[0]["in_progress"] if results else False

Usage Examples

Basic Usage (Single Agent)

# Agent does work, fails, resumes automatically

# First run - fails after creating transformation
await consultant.think(focus=segment, agent_id="agent-1")
# Creates: Operation(pending), Transformation, but crashes before Transitions

# Second run - resumes from cached LLM results
await consultant.think(focus=segment, agent_id="agent-1")
# Finds pending Operation, uses cached results, creates missing Transitions
# Marks Operation complete

Multi-Agent Coordination

# Two agents working on different WUs
agent_a_task = consultant_a.think(focus=segment_1, agent_id="agent-a")
agent_b_task = consultant_b.think(focus=segment_2, agent_id="agent-b")
await asyncio.gather(agent_a_task, agent_b_task)

# Each agent tracks its own operations independently
# No collision even if working on same wheel

Checking for Conflicts

# Before starting expensive work, check if someone else is already on it
op_manager = OperationManager()

if op_manager.is_target_being_worked_on(wu.uid, "action_reflection", exclude_agent="me"):
    # Another agent is already processing this WU
    # Could wait, skip, or coordinate
    pass

Cleanup Failed Operations

# Periodic cleanup of failed operations
op_manager = OperationManager()
deleted = op_manager.cleanup_failed(agent_id="agent-1")
print(f"Cleaned up {deleted} failed operations")

Migration Strategy

Phase 1: Non-Breaking Addition

  1. Add Operation node and OperationManager (no changes to existing code)
  2. Add tests for operation lifecycle

Phase 2: Opt-In Integration

  1. Add agent_id parameter to consultant methods (default="default")
  2. Integrate OperationManager into ThinkActionReflection behind feature flag
  3. Test with single operation type

Phase 3: Full Rollout

  1. Remove feature flag, enable for all operations
  2. Integrate into remaining consultant classes
  3. Add multi-agent coordination queries
  4. Document agent ID conventions

Graph Schema Changes

New Node Type

(:Operation {
    uid: String,           -- Unique identifier (UUID)
    operation_type: String, -- "action_reflection", "synthesis", etc.
    status: String,        -- "pending", "complete", "failed"
    started_at: DateTime,
    completed_at: DateTime?,
    created_by: String,    -- Agent identifier
    target_uid: String,    -- Target node UID
    target_type: String,   -- Target node type
    cached_results: String?, -- JSON serialized LLM results
    error_message: String?  -- Error if failed
})

New Relationship Type

(:Operation)-[:CREATED_ARTIFACT]->(:BaseNode)

Index Recommendations

CREATE INDEX ON :Operation(created_by, status);
CREATE INDEX ON :Operation(target_uid, operation_type, status);

Considerations

What This Solves

  • Partial failures: Resume from where you left off
  • Duplicate prevention: Idempotent by checking existing operation
  • LLM cost savings: Cached results survive failures
  • Multi-agent: Each agent's work is isolated and trackable
  • Debugging: Clear audit trail of what each agent did

What This Doesn't Solve

  • Concurrent writes to same node: Still need application-level coordination
  • Cross-agent dependencies: If agent B depends on agent A's output, need separate coordination
  • Long-running operation timeouts: No built-in timeout/stale detection (could add)

Future Enhancements

  1. Timeout detection: Mark operations stale after N minutes
  2. Operation dependencies: Model "this op depends on that op completing"
  3. Distributed locking: Prevent concurrent work on same target
  4. Operation history: Keep completed operations for audit trail (vs deleting)

References

  • Saga Pattern: Distributed transaction management in microservices
  • Idempotency Keys: Stripe, AWS - ensuring exactly-once processing
  • Event Sourcing: Store events, derive state, replay on failure
  • Kubernetes Controller Pattern: Desired state vs actual state reconciliation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions