Repository files navigation
coverit-docgen is a background document-generation and semantic-labeling service.
Its primary implemented workflow incrementally labels UI states and transitions stored as a session graph in Neo4j.
It generates human-readable page names, descriptions, element names, and action descriptions from recorded URLs, HTML snapshots, geometry, and Playwright locators.
Python 3.10+; production image uses Python 3.11.
ARQ async worker and cron scheduling over Redis.
Neo4j async driver for session graphs and labeling status.
PostgreSQL with SQLAlchemy async and asyncpg; ORM models exist for labeled artifacts.
Pydantic and pydantic-settings for data models and environment configuration.
Beautiful Soup for HTML parsing.
Playwright Chromium for resolving transition locators.
Docker and Docker Compose.
Standard-library unittest.
Install the generated Python package with uv using the coverit-contracts distribution name.
Import protobuf modules from the generated contracts namespace, for example from contracts.crawler.v1 import crawler_pb2.
src/worker.py: ARQ entry point, lifecycle hooks, task registration, cron configuration, and early logging setup.
src/tasks/poller.py: atomically claims eligible Neo4j records and enqueues one graph-labeling job per session.
src/tasks/labeling.py: single-state, single-transition, and session-graph tasks with per-item failure isolation.
src/repositories/labeling_repo.py: Neo4j persistence boundary.
src/models/queries.py: centralized Cypher statements.
src/services/labeling/: page analysis, element naming, action descriptions, and Playwright-based transition labeling.
src/core/: settings, logging, Neo4j, Redis, and PostgreSQL lifecycle management.
src/services/bdd, guides, and video currently contain placeholders and no implemented business logic.
Neo4j State node:
session_id: owning recorded session.
url, html: page snapshot inputs.
name, description: generated labels.
labeling_status: PENDING, QUEUED, COMPLETED, or absent.
labeling_claim_id: temporary poll-specific ownership token.
Neo4j TRANSITION relationship connects two State nodes:
locator_value: Playwright locator for the interacted element.
name, action: generated semantic labels.
Same status and claim fields as states.
Pydantic models:
CrawlerState, CrawlerTransition, CrawlerGraph.
LabeledState, LabeledTransition, LabeledGraph.
CrawlerGraph.skip_states identifies origin states loaded only as transition context.
PostgreSQL ORM models:
labeled_states(state_id, name, description).
labeled_elements(element_id, html_snippet, name, action).
The active labeling workflow currently reads and writes labels directly in Neo4j.
Features Already Implemented
Incremental graph polling:
Claims only absent/PENDING records and changes them to QUEUED.
Uses a unique UUID claim token generated once per poll query.
Session-isolated processing:
State and transition graph fetches are scoped by session_id.
Transitions require both endpoint states to belong to the session.
Fault-tolerant ARQ dispatch:
Claims occur before enqueueing.
Enqueue failure returns exactly the claimed IDs to PENDING.
Per-item graph labeling:
Successful records are immediately saved as COMPLETED.
A failed record alone returns to PENDING; processing continues.
Single-item rollback:
States and transitions are identified by Neo4j elementId; no redundant session lookup is performed.
Page analysis:
Combines semantic URL paths, selected query parameters, fragments, title, h1, Open Graph tags, metadata, active navigation, and domain fallback.
Filters numeric IDs, UUIDs, tokens, filenames, tracking parameters, pagination, and sorting.
Produces deterministic names and descriptions capped at 160 characters.
Element contextual naming:
Uses nearby meaningful elements when within a normalized 0.40 distance threshold.
Uses one of nine absolute screen regions for distant or absent neighbors.
Transition labeling:
Uses Playwright Chromium to resolve and mark the locator in page HTML.
Generates an element name, cleaned HTML snippet, and action description.
Logging:
Console and rotating /app/logs/worker.log handlers.
Application debug logging remains available.
Neo4j debug/info output is suppressed; warnings and errors remain.
Container support:
Non-root production worker.
Chromium and system dependencies installed.
Persistent Compose volume for logs.
Automated coverage for query invariants, rollback behavior, async transitions, page analysis, contextual naming, logging, and enqueue failures.
Important Design Decisions
Neo4j is the source of truth for graph topology and labeling lifecycle.
Status lifecycle is NULL/PENDING -> QUEUED -> COMPLETED, with failures returning only the affected item to PENDING.
Claiming and status mutation happen in one Cypher query before ARQ dispatch.
A dynamic labeling_claim_id distinguishes records claimed by concurrent poll runs.
Neo4j elementId is the authoritative identifier for individual state and transition operations.
Graph-session boundaries remain mandatory for graph fetches, claims, and transition endpoint validation.
Labeling is deterministic and local; it does not call an external AI service.
Logging must be initialized before importing modules that create loggers.
Labeling operations and database access are asynchronous.
Playwright-dependent transition labeling must be awaited.
Missing transition HTML, locator metadata, locator matches, names, or actions are failures and must not be saved as completed.
Completed records must never be reclaimed or relabeled.
One failing item must not roll back successful or unrelated items.
ARQ enqueue failure must not leave records permanently QUEUED.
Neo4j indexes are recommended for State(session_id), State(labeling_status), composite state session/status lookup, and transition status.
max_sessions_per_poll and context_distance_threshold are settings; the current defaults are 5 and 0.40.
Coding Conventions In This Project
Use async functions for database, ARQ, and Playwright workflows.
Keep Cypher in src/models/queries.py.
Keep Neo4j access behind LabelingRepository.
Keep orchestration in src/tasks and semantic logic in src/services.
Use Pydantic models at service boundaries.
Use logging.getLogger(...); do not call basicConfig.
Use parameterized logging rather than interpolated strings where practical.
Raise explicit errors for invalid labeling inputs so callers can perform status rollback.
Tests use unittest, IsolatedAsyncioTestCase, and unittest.mock.
Things Future Features Must Be Compatible With
Preserve get_page_info(url, soup) -> {"name": ..., "description": ...}.
Preserve uppercase Neo4j status values and their lifecycle.
Preserve dynamic UUID claim ownership; never replace $claim_id with a fixed value.
Preserve queued-only completion and rollback guards.
Preserve per-item failure isolation.
Preserve Neo4j elementId identifiers for individual state and transition operations.
Preserve session scoping for graph-level operations and transition endpoint validation.
Preserve ARQ task names registered in WorkerSettings.
Preserve early logging initialization, Neo4j warning-level filtering, rotating file logging, and /app/logs persistence.
Production images must include Playwright Chromium and run as the non-root docgen user.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
You can’t perform that action at this time.