test(e2e): add e2e test suite for knowledge and skills components by Iftach-Shoham · Pull Request #232 · cuga-project/cuga-agent

Iftach-Shoham · 2026-05-17T07:59:15Z

Summary

Closes #206.

Adds a new tests/e2e/ directory with full end-to-end test coverage for the knowledge and skills components as requested in the issue. Tests are structured in two tiers per component: Tier 1 exercises the component API directly (no LLM, no graph — fast and surgical), Tier 2 runs the real compiled CugaLiteGraph with a hermetic mock LLM and asserts on what actually reached the model.

What's covered

Knowledge (`test_knowledge_e2e.py`) — 12 tests

Class	What it tests
`TestKnowledgeEngineLifecycle`	ingest → search → delete → re-ingest (stale chunk purge)
`TestKnowledgeAwareness`	`get_knowledge_summary()` output; `format_knowledge_context()` naming and special-char sanitization
`TestKnowledgeRagPath`	real `knowledge_search_knowledge` tool via `RealSearchKnowledgeToolProvider` calls `engine.search()` and returns ingested content; stale chunk not returned after replace
`TestKnowledgeCugaLiteIntegration`	agent-scoped knowledge in system prompt; session-scoped knowledge in system prompt; knowledge + skills coexistence — all assert both filename and document content appear

Skills (`test_skills_e2e.py`) — 15 tests + 1 skip

Class	What it tests
`TestSkillDiscovery`	`discover_skills()` finds single/multiple skills, preserves description, parses pip and npm requirements
`TestSkillRegistry`	`load_skill()` body order, `uv pip install` / `npm install` commands, mixed deps (STEP 1 before STEP 2), normalization hint
`TestSkillToolsAndBlock`	`create_skill_tools()` returns `load_skill`; block contains names and descriptions
`TestSkillsCugaLiteIntegration`	skill name in system prompt; `load_skill` bound to model in native tool-calling mode
`TestSkillsBlockedPaths`	`test_skill_executed_via_sub_agent` — skipped, blocked on #199

Shared infrastructure (`conftest.py`)

CaptureChatModel — hermetic mock LLM: records every message list and tool binding, replays scripted responses, raises on queue exhaustion (catches unexpected graph loops)
KnowledgeToolProvider — stub knowledge_search_knowledge tool (triggers the startswith("knowledge_") detection in cuga_lite_graph.py:1493) without requiring a running MCP server
RealSearchKnowledgeToolProvider — real knowledge_search_knowledge tool backed by engine.search(), covers the RAG retrieval path
knowledge_engine fixture — isolated fastembed + sqlite-vec engine in tmp_path, monkeypatched off any real DB

Test plan

uv run pytest tests/e2e/test_knowledge_e2e.py -v — 12 passed
uv run pytest tests/e2e/test_skills_e2e.py -v — 15 passed, 1 skipped (expected, blocked on [Feature] Skills should be able to spawn sub-agents at runtime via use_sub_agents #199)
No real LLM, no API keys, no network required — fully hermetic

Summary by CodeRabbit

Tests
- Added end-to-end tests for knowledge ingestion, search, deletion, re-ingest, summaries, and RAG retrieval via tool boundaries.
- Added end-to-end tests for skills discovery, requirement parsing, load-order, registry/tool behavior, and graph integration asserting system prompts include discovered skills.
- Added shared e2e fixtures/helpers for isolated engine setup, task polling, skill file creation, and a mock chat model to capture system prompts.

Implements the full test coverage requested in issue #206. **What's covered** Knowledge (`tests/e2e/test_knowledge_e2e.py`): - Tier 1 – KnowledgeEngine lifecycle: ingest → search → delete → replace cycle - Tier 1 – Awareness path: get_knowledge_summary() output and format_knowledge_context() collection naming / sanitization - Tier 1 – RAG retrieval path: RealSearchKnowledgeToolProvider calls engine.search() directly, asserts ingested content is returned by the tool - Tier 2 – CugaLiteGraph integration: agent-scoped and session-scoped knowledge summaries injected into the system prompt; knowledge + skills coexistence Skills (`tests/e2e/test_skills_e2e.py`): - Tier 1 – Discovery: discover_skills() finds all skills, preserves description, parses pip and npm requirements - Tier 1 – Registry: load_skill() body ordering, uv/npm install commands, mixed deps, normalization hint - Tier 1 – Tools/block: create_skill_tools() returns load_skill; format_available_skills_block() lists names and descriptions - Tier 2 – CugaLiteGraph integration: skills block appears in system prompt; load_skill bound to model in native tool-calling mode - Blocked: test_skill_executed_via_sub_agent skipped pending #199 **Shared infrastructure (`tests/e2e/conftest.py`)** - CaptureChatModel: hermetic mock LLM that records inputs and replays scripted responses - KnowledgeToolProvider: stub search tool (triggers awareness detection path) - RealSearchKnowledgeToolProvider: real search tool backed by KnowledgeEngine (covers RAG path) - knowledge_engine fixture: isolated fastembed + sqlite-vec engine in tmp_path Closes #206

coderabbitai · 2026-05-17T07:59:30Z

📝 Walkthrough

Walkthrough

Adds an end-to-end test harness (mock chat model, tool providers, helpers, fixture) and comprehensive E2E tests for knowledge ingest/search and skills discovery/integration, exercising both component behaviors and CugaLite graph prompt injection.

Changes

Knowledge and Skills E2E Tests

Layer / File(s)	Summary
E2E test harness: mock LLM and tool providers `tests/e2e/conftest.py`	Adds `CaptureChatModel` to record/replay LLM messages and tool-provider classes (`MinimalToolProvider`, `KnowledgeToolProvider`, `RealSearchKnowledgeToolProvider`) for stubbed and real knowledge search tooling.
E2E test utilities and fixture `tests/e2e/conftest.py`	Adds `write_skill`, `poll_task`, `extract_system_content`, and an async `knowledge_engine` pytest fixture that provisions an isolated sqlite-backed KnowledgeEngine and performs warmup/teardown.
Knowledge component end-to-end tests `tests/e2e/test_knowledge_e2e.py`	Tier 1 tests for ingest/search/delete/replace, knowledge-summary and collection formatting; RAG/tool-boundary tests via `RealSearchKnowledgeToolProvider`; Tier 2 CugaLite integration tests asserting agent- and session-scoped knowledge (and combined knowledge+skill) are injected into system prompts.
Skills component end-to-end tests `tests/e2e/test_skills_e2e.py`	Tier 1 discovery and registry tests for `.cuga` skills, requirement parsing (pip/npm), `load_skill` output and formatting; Tier 2 CugaLite integration asserting skill names and native `load_skill` bindings appear in captured system prompts; includes a skipped placeholder test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through tests with tiny paws,

Mocked replies and neat applause,
Ingested notes and skills that sing,
Prompts arranged for everything,
A cheerful rabbit signs the clause.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title clearly summarizes the main change: adding end-to-end tests for knowledge and skills components, matching the changeset's three test files and shared fixtures.
Linked Issues check	✅ Passed	The PR fully addresses all coding requirements from issue `#206`: hermetic e2e test suites for knowledge and skills, knowledge lifecycle testing with isolated fixture, RAG retrieval validation, skill discovery/loading verification, shared test infrastructure, and handling of blocked features.
Out of Scope Changes check	✅ Passed	All changes are within scope of issue `#206`: test infrastructure (conftest.py) and two test modules (knowledge and skills e2e tests) with no unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/206-e2e-test-for-knowledge-and-skills

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/e2e/test_knowledge_e2e.py (2)

4-7: ⚡ Quick win

Replace EN DASH characters to clear Ruff warnings.

Line 4, Line 6, Line 53, Line 107, Line 150, and Line 204 use –, which triggers RUF002/RUF003. Replace with plain - to keep lint clean.

Proposed diff

-  Tier 1 – component-level (no LLM, no graph): exercises KnowledgeEngine's public
+  Tier 1 - component-level (no LLM, no graph): exercises KnowledgeEngine's public
-  Tier 2 – graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts
+  Tier 2 - graph-level (CaptureChatModel): runs CugaLite with a mock LLM and asserts
-# Tier 1 – KnowledgeEngine lifecycle
+# Tier 1 - KnowledgeEngine lifecycle
-# Tier 1 – Knowledge awareness (prompt-injection path)
+# Tier 1 - Knowledge awareness (prompt-injection path)
-# Tier 1 – RAG retrieval path (tool boundary)
+# Tier 1 - RAG retrieval path (tool boundary)
-# Tier 2 – CugaLite graph integration
+# Tier 2 - CugaLite graph integration

Also applies to: 53-53, 107-107, 150-150, 204-204

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_knowledge_e2e.py` around lines 4 - 7, Replace all EN DASH
characters (–) with ASCII hyphens (-) in the test docstrings/descriptions to
silence Ruff RUF002/RUF003 warnings: update the strings such as "Tier 1 –
component-level (no LLM, no graph)" and "Tier 2 – graph-level
(CaptureChatModel)" and any other occurrences on the noted lines so they use "-"
instead of "–". Search for the literal "–" in tests/e2e/test_knowledge_e2e.py
(e.g., the Tier lines and the other flagged lines) and perform a simple
character substitution, then run the linter to confirm warnings are cleared.

172-175: ⚡ Quick win

Avoid positional tool selection in RAG tests.

Using tools[0] makes these tests order-dependent. Select the knowledge tool by name to prevent brittle failures if provider ordering changes.

Proposed diff

         provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION)
         tools = await provider.get_all_tools()
-        knowledge_tool = tools[0]
+        knowledge_tool = next(
+            (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"),
+            None,
+        )
+        assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"
@@
         provider = RealSearchKnowledgeToolProvider(engine, _COLLECTION)
         tools = await provider.get_all_tools()
-        knowledge_tool = tools[0]
+        knowledge_tool = next(
+            (t for t in tools if getattr(t, "name", "") == "knowledge_search_knowledge"),
+            None,
+        )
+        assert knowledge_tool is not None, "knowledge_search_knowledge tool not found"

Also applies to: 193-195

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/test_knowledge_e2e.py` around lines 172 - 175, The test currently
selects a knowledge tool by position (tools[0]) which is order-dependent;
instead, locate the correct tool returned by
RealSearchKnowledgeToolProvider.get_all_tools() by matching its name/identifier
(e.g., filter for tool.name or tool.id that matches the expected knowledge tool)
and assign that result to knowledge_tool; make the same change for the other
occurrence that uses positional selection so both tests pick the tool by
explicit name rather than by index.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/conftest.py`:
- Around line 236-238: The teardown should guarantee engine.shutdown() runs even
if await engine.aclose() raises: wrap the await engine.aclose() call in a
try/finally (or call engine.shutdown() from a finally block) so that
engine.shutdown() is always invoked; if you need to preserve the original
exception, re-raise it after shutdown or log the aclose() error before
re-raising so resources are not leaked (refer to the engine.aclose() and
engine.shutdown() calls).

---

Nitpick comments:
In `@tests/e2e/test_knowledge_e2e.py`:
- Around line 4-7: Replace all EN DASH characters (–) with ASCII hyphens (-) in
the test docstrings/descriptions to silence Ruff RUF002/RUF003 warnings: update
the strings such as "Tier 1 – component-level (no LLM, no graph)" and "Tier 2 –
graph-level (CaptureChatModel)" and any other occurrences on the noted lines so
they use "-" instead of "–". Search for the literal "–" in
tests/e2e/test_knowledge_e2e.py (e.g., the Tier lines and the other flagged
lines) and perform a simple character substitution, then run the linter to
confirm warnings are cleared.
- Around line 172-175: The test currently selects a knowledge tool by position
(tools[0]) which is order-dependent; instead, locate the correct tool returned
by RealSearchKnowledgeToolProvider.get_all_tools() by matching its
name/identifier (e.g., filter for tool.name or tool.id that matches the expected
knowledge tool) and assign that result to knowledge_tool; make the same change
for the other occurrence that uses positional selection so both tests pick the
tool by explicit name rather than by index.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 364c908c-72b0-4424-b9c4-177311503a5e

📥 Commits

Reviewing files that changed from the base of the PR and between 5a38799 and bc13f53.

📒 Files selected for processing (4)

tests/e2e/__init__.py
tests/e2e/conftest.py
tests/e2e/test_knowledge_e2e.py
tests/e2e/test_skills_e2e.py

- conftest.py: wrap engine.aclose() in try/finally so shutdown() is always called even if aclose() raises (resource leak fix) - test_knowledge_e2e.py: replace EN dash characters with ASCII hyphens in docstrings/comments to silence Ruff RUF002/RUF003 - test_knowledge_e2e.py: select knowledge_search_knowledge tool by name instead of positional index tools[0] to avoid order-dependent failures

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/test_skills_e2e.py`:
- Line 99: Replace hardcoded "/tmp" string literals in the test fixtures with a
temporary directory fixture; locate instances like
source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py and change them
to use the pytest tmp_path (or tempfile.gettempdir/tmp_path_factory) to
construct paths (e.g., str(tmp_path / name / "SKILL.md")), and update the other
occurrences on the same file (lines referencing "/tmp/...") similarly so all
test paths are built from the tmp_path/tempfile API instead of a hardcoded
"/tmp".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba2ee880-4eb2-40ad-af65-317306217a57

📥 Commits

Reviewing files that changed from the base of the PR and between f433c8b and 7e2734f.

📒 Files selected for processing (3)

tests/e2e/conftest.py
tests/e2e/test_knowledge_e2e.py
tests/e2e/test_skills_e2e.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/e2e/test_knowledge_e2e.py
tests/e2e/conftest.py

coderabbitai · 2026-05-17T11:23:41Z

+            name=name,
+            description=f"{name} description",
+            body=body,
+            source=f"/tmp/{name}/SKILL.md",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace hardcoded /tmp literals to avoid Ruff S108 CI failures.

Line 99, Line 164, Line 176, Line 177, and Line 187 use hardcoded /tmp/... paths in test fixtures. Ruff flags these as errors (S108), which can block CI even though these are test-only strings.

Proposed patch

- source=f"/tmp/{name}/SKILL.md", + source=f"skills/{name}/SKILL.md",

- source="/tmp/SKILL.md", + source="skills/SKILL.md",

- SkillEntry("alpha", "Alpha skill", "## Body", "/tmp/a/SKILL.md", ()), - SkillEntry("beta", "Beta skill", "## Body", "/tmp/b/SKILL.md", ()), + SkillEntry("alpha", "Alpha skill", "## Body", "skills/a/SKILL.md", ()), + SkillEntry("beta", "Beta skill", "## Body", "skills/b/SKILL.md", ()),

- entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "/tmp/g/SKILL.md", ()) + entry = SkillEntry("gamma", "Gamma makes reports", "## Body", "skills/g/SKILL.md", ())

Also applies to: 164-164, 176-177, 187-187

🧰 Tools

🪛 Ruff (0.15.12)

[error] 99-99: Probable insecure usage of temporary file or directory: "/tmp/"

(S108)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/test_skills_e2e.py` at line 99, Replace hardcoded "/tmp" string literals in the test fixtures with a temporary directory fixture; locate instances like source=f"/tmp/{name}/SKILL.md" in tests/e2e/test_skills_e2e.py and change them to use the pytest tmp_path (or tempfile.gettempdir/tmp_path_factory) to construct paths (e.g., str(tmp_path / name / "SKILL.md")), and update the other occurrences on the same file (lines referencing "/tmp/...") similarly so all test paths are built from the tmp_path/tempfile API instead of a hardcoded "/tmp".

Iftach-Shoham requested a review from sami-marreed May 17, 2026 07:59

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread tests/e2e/conftest.py Outdated

Iftach Shoham added 2 commits May 17, 2026 14:16

style(e2e): apply ruff format to e2e test files

7e2734f

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

sami-marreed marked this pull request as draft May 17, 2026 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): add e2e test suite for knowledge and skills components#232

test(e2e): add e2e test suite for knowledge and skills components#232
Iftach-Shoham wants to merge 3 commits into
mainfrom
test/206-e2e-test-for-knowledge-and-skills

Iftach-Shoham commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Iftach-Shoham commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's covered

Knowledge (test_knowledge_e2e.py) — 12 tests

Skills (test_skills_e2e.py) — 15 tests + 1 skip

Shared infrastructure (conftest.py)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Iftach-Shoham commented May 17, 2026 •

edited by coderabbitai Bot

Loading

Knowledge (`test_knowledge_e2e.py`) — 12 tests

Skills (`test_skills_e2e.py`) — 15 tests + 1 skip

Shared infrastructure (`conftest.py`)

coderabbitai Bot commented May 17, 2026 •

edited

Loading