GitHub - CoveritLabs/coverit-docgen

Project Overview

coverit-docgen is a background document-generation and semantic-labeling service.
Its primary implemented workflow incrementally labels UI states and transitions stored as a session graph in Neo4j.
It generates human-readable page names, descriptions, element names, and action descriptions from recorded URLs, HTML snapshots, geometry, and Playwright locators.

Tech Stack

Python 3.10+; production image uses Python 3.11.
ARQ async worker and cron scheduling over Redis.
Neo4j async driver for session graphs and labeling status.
PostgreSQL with SQLAlchemy async and asyncpg; ORM models exist for labeled artifacts.
Pydantic and pydantic-settings for data models and environment configuration.
Beautiful Soup for HTML parsing.
Playwright Chromium for resolving transition locators.
Docker and Docker Compose.
Standard-library unittest.

Contract Types

Install the generated Python package with uv using the coverit-contracts distribution name.
Import protobuf modules from the generated contracts namespace, for example from contracts.crawler.v1 import crawler_pb2.

Current Architecture

src/worker.py: ARQ entry point, lifecycle hooks, task registration, cron configuration, and early logging setup.
src/tasks/poller.py: atomically claims eligible Neo4j records and enqueues one graph-labeling job per session.
src/tasks/labeling.py: single-state, single-transition, and session-graph tasks with per-item failure isolation.
src/repositories/labeling_repo.py: Neo4j persistence boundary.
src/models/queries.py: centralized Cypher statements.
src/services/labeling/: page analysis, element naming, action descriptions, and Playwright-based transition labeling.
src/core/: settings, logging, Neo4j, Redis, and PostgreSQL lifecycle management.
src/services/bdd, guides, and video currently contain placeholders and no implemented business logic.

Existing Data Models

Neo4j State node:
- session_id: owning recorded session.
- url, html: page snapshot inputs.
- name, description: generated labels.
- labeling_status: PENDING, QUEUED, COMPLETED, or absent.
- labeling_claim_id: temporary poll-specific ownership token.
Neo4j TRANSITION relationship connects two State nodes:
- locator_value: Playwright locator for the interacted element.
- name, action: generated semantic labels.
- Same status and claim fields as states.
Pydantic models:
- CrawlerState, CrawlerTransition, CrawlerGraph.
- LabeledState, LabeledTransition, LabeledGraph.
- CrawlerGraph.skip_states identifies origin states loaded only as transition context.
PostgreSQL ORM models:
- labeled_states(state_id, name, description).
- labeled_elements(element_id, html_snippet, name, action).
The active labeling workflow currently reads and writes labels directly in Neo4j.

Features Already Implemented

Incremental graph polling:
- Claims only absent/PENDING records and changes them to QUEUED.
- Uses a unique UUID claim token generated once per poll query.
Session-isolated processing:
- State and transition graph fetches are scoped by session_id.
- Transitions require both endpoint states to belong to the session.
Fault-tolerant ARQ dispatch:
- Claims occur before enqueueing.
- Enqueue failure returns exactly the claimed IDs to PENDING.
Per-item graph labeling:
- Successful records are immediately saved as COMPLETED.
- A failed record alone returns to PENDING; processing continues.
Single-item rollback:
- States and transitions are identified by Neo4j elementId; no redundant session lookup is performed.
Page analysis:
- Combines semantic URL paths, selected query parameters, fragments, title, h1, Open Graph tags, metadata, active navigation, and domain fallback.
- Filters numeric IDs, UUIDs, tokens, filenames, tracking parameters, pagination, and sorting.
- Produces deterministic names and descriptions capped at 160 characters.
Element contextual naming:
- Uses nearby meaningful elements when within a normalized 0.40 distance threshold.
- Uses one of nine absolute screen regions for distant or absent neighbors.
Transition labeling:
- Uses Playwright Chromium to resolve and mark the locator in page HTML.
- Generates an element name, cleaned HTML snippet, and action description.
Logging:
- Console and rotating /app/logs/worker.log handlers.
- Application debug logging remains available.
- Neo4j debug/info output is suppressed; warnings and errors remain.
Container support:
- Non-root production worker.
- Chromium and system dependencies installed.
- Persistent Compose volume for logs.
Automated coverage for query invariants, rollback behavior, async transitions, page analysis, contextual naming, logging, and enqueue failures.

Important Design Decisions

Neo4j is the source of truth for graph topology and labeling lifecycle.
Status lifecycle is NULL/PENDING -> QUEUED -> COMPLETED, with failures returning only the affected item to PENDING.
Claiming and status mutation happen in one Cypher query before ARQ dispatch.
A dynamic labeling_claim_id distinguishes records claimed by concurrent poll runs.
Neo4j elementId is the authoritative identifier for individual state and transition operations.
Graph-session boundaries remain mandatory for graph fetches, claims, and transition endpoint validation.
Labeling is deterministic and local; it does not call an external AI service.
Logging must be initialized before importing modules that create loggers.

Existing Constraints

Labeling operations and database access are asynchronous.
Playwright-dependent transition labeling must be awaited.
Missing transition HTML, locator metadata, locator matches, names, or actions are failures and must not be saved as completed.
Completed records must never be reclaimed or relabeled.
One failing item must not roll back successful or unrelated items.
ARQ enqueue failure must not leave records permanently QUEUED.
Neo4j indexes are recommended for State(session_id), State(labeling_status), composite state session/status lookup, and transition status.
max_sessions_per_poll and context_distance_threshold are settings; the current defaults are 5 and 0.40.

Coding Conventions In This Project

Use async functions for database, ARQ, and Playwright workflows.
Keep Cypher in src/models/queries.py.
Keep Neo4j access behind LabelingRepository.
Keep orchestration in src/tasks and semantic logic in src/services.
Use Pydantic models at service boundaries.
Use logging.getLogger(...); do not call basicConfig.
Use parameterized logging rather than interpolated strings where practical.
Raise explicit errors for invalid labeling inputs so callers can perform status rollback.
Tests use unittest, IsolatedAsyncioTestCase, and unittest.mock.

Things Future Features Must Be Compatible With

Preserve get_page_info(url, soup) -> {"name": ..., "description": ...}.
Preserve uppercase Neo4j status values and their lifecycle.
Preserve dynamic UUID claim ownership; never replace $claim_id with a fixed value.
Preserve queued-only completion and rollback guards.
Preserve per-item failure isolation.
Preserve Neo4j elementId identifiers for individual state and transition operations.
Preserve session scoping for graph-level operations and transition endpoint validation.
Preserve ARQ task names registered in WorkerSettings.
Preserve early logging initialization, Neo4j warning-level filtering, rotating file logging, and /app/logs persistence.
Production images must include Playwright Chromium and run as the non-root docgen user.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
.husky		.husky
overrides		overrides
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.prettierrc		.prettierrc
.python-version		.python-version
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
docker.sh		docker.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Tech Stack

Contract Types

Current Architecture

Existing Data Models

Features Already Implemented

Important Design Decisions

Existing Constraints

Coding Conventions In This Project

Things Future Features Must Be Compatible With

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Tech Stack

Contract Types

Current Architecture

Existing Data Models

Features Already Implemented

Important Design Decisions

Existing Constraints

Coding Conventions In This Project

Things Future Features Must Be Compatible With

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages