Skip to content

GSOC-Week1-Module_A#920

Open
ParthAggarwal16 wants to merge 10 commits into
OWASP:mainfrom
ParthAggarwal16:week_1-2-harvester-validation-and-docs
Open

GSOC-Week1-Module_A#920
ParthAggarwal16 wants to merge 10 commits into
OWASP:mainfrom
ParthAggarwal16:week_1-2-harvester-validation-and-docs

Conversation

@ParthAggarwal16

Copy link
Copy Markdown
Contributor

Summary

This PR adds the initial repository configuration loading and validation layer for the harvester pipeline.

Added

  • YAML-based repository configuration loading
  • Pydantic schema validation for repository configs
  • Semantic repository validation layer
  • Validation fixtures and pytest coverage
  • Sample harvester repository configuration
  • Repository validation utilities and exports
  • Exclude pattern support scaffold

Validation Coverage

Config Loader Validation

  • Missing repository ID
  • Invalid chunk size
  • Invalid YAML syntax
  • Missing config file
  • Empty owner validation
  • Invalid chunking strategy
  • Invalid polling mode

Repository Semantic Validation

  • Duplicate repository IDs
  • Duplicate repositories
  • Duplicate include paths

Test Plan

Executed:

bash python3 -m pytest application/tests/harvester_test -v python3 -m coverage report python3 -m black application/utils/harvester application/tests/harvester_test

All tests passing.

Notes for Reviewers

  • Validation is intentionally split into:

    • schema validation (config_loader.py)
    • semantic/business-rule validation (repos_validator.py)
  • Repository uniqueness checks are case-insensitive.

  • exclude_patterns.txt is currently scaffolded for future integration into harvesting logic.

image

ParthAggarwal16#3 (comment)
this comment has some good coderabbit diagrams if you guys wanna have a look at it

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c3fe4d73-4c31-47c9-9959-49e8bc5913eb

📥 Commits

Reviewing files that changed from the base of the PR and between c7ecbf0 and 0be6df8.

📒 Files selected for processing (2)
  • application/tests/harvester_test/test_repos_validator.py
  • application/utils/harvester/__init__.py
✅ Files skipped from review due to trivial changes (1)
  • application/utils/harvester/init.py

Summary by CodeRabbit

  • Tests

    • Added comprehensive tests covering configuration loading and validation, including many fixture scenarios for valid and invalid repository configs (missing/invalid fields, syntax errors, duplicates, and edge cases).
  • New Features

    • Added YAML-based repository configuration support with strict schema validation (path rules, chunking constraints, polling settings) and clear config-loading errors.
    • Added repository-level semantic validation to catch duplicate IDs, duplicate repositories, and duplicate include paths.
    • Added default repository list and exclude-patterns for harvesting.

Walkthrough

Adds a harvester configuration subsystem: Pydantic schemas, YAML loader with error mapping, semantic validators for duplicates, package exports, example repo configs and exclude patterns, and pytest fixtures/tests covering valid and many invalid cases.

Changes

Repository Configuration Harvester Framework

Layer / File(s) Summary
Schema and data contracts
application/utils/harvester/schemas.py
PathRules, ChunkingConfig, PollingConfig, RepositoryConfig, and ReposFile define the configuration structure with strict field validation, token size constraints (overlap_tokens < max_tokens), and required non-empty repository lists.
Configuration loading and error handling
application/utils/harvester/config_loader.py
ConfigLoaderError exception and load_repo_config() function parse YAML files with path validation and convert YAML syntax errors and Pydantic validation failures into consistent exception types with chained causes.
Semantic duplicate and constraint validation
application/utils/harvester/repos_validator.py
RepositoryValidationError exception and validate_repositories() function perform case-insensitive checks for duplicate repository IDs, duplicate owner/repo pairs, and duplicate include paths.
Public API package exports
application/utils/harvester/__init__.py
Re-exports schema models, config loading functions/errors, and validation functions/errors via explicit __all__ to define the public API surface.
Test fixture configurations
application/tests/harvester_test/__init__.py, application/tests/harvester_test/fixtures/*
YAML test fixtures covering valid configs and invalid scenarios: empty/missing fields, duplicate IDs/paths/repositories, invalid numeric values, invalid enums, and malformed YAML syntax.
Config loader test cases
application/tests/harvester_test/test_config_loader.py
Test module covering load_repo_config() with valid-config success case and failure cases for missing required fields, invalid YAML, unsupported enum values, invalid numeric constraints, and missing files.
Repository validator test cases
application/tests/harvester_test/test_repos_validator.py
Test module covering validate_repositories() with failure cases for duplicate repository IDs, duplicate repositories, and duplicate include paths.
Example configurations and patterns
application/utils/harvester/repos.yaml, application/utils/harvester/exclude_patterns.txt
Sample repos.yaml defining two OWASP repositories with markdown chunking and incremental polling; exclude_patterns.txt defines glob patterns for build artifacts and static file types.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'GSOC-Week1-Module_A' is vague and generic, using non-descriptive identifiers that do not convey the actual changes made (repository configuration loading and validation). Consider a more descriptive title like 'Add repository configuration loading and validation layer' that clearly indicates the main purpose of the changeset.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive and directly related to the changeset, clearly detailing the addition of configuration loading, validation infrastructure, test coverage, and implementation approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ParthAggarwal16

Copy link
Copy Markdown
Contributor Author

ParthAggarwal16#3 (comment)
also this one has the entire HLA for more context

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
application/tests/harvester_test/test_config_loader.py (1)

43-45: ⚡ Quick win

Use Path object for consistency.

Line 45 passes a string literal while all other tests use Path objects from FIXTURES_DIR. For consistency and clarity, consider using FIXTURES_DIR / "does_not_exist.yaml" or another Path-based approach.

♻️ Proposed fix for consistency
 def test_missing_config_file():
-    with pytest.raises(FileNotFoundError):
-        load_repo_config("does_not_exist.yaml")
+    config_path = FIXTURES_DIR / "does_not_exist.yaml"
+    with pytest.raises(FileNotFoundError):
+        load_repo_config(config_path)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/harvester_test/test_config_loader.py` around lines 43 - 45,
Replace the string literal path in test_missing_config_file with a Path-based
fixture to match the rest of the tests: call load_repo_config with FIXTURES_DIR
/ "does_not_exist.yaml" instead of "does_not_exist.yaml" so the test uses a Path
object (refer to test_missing_config_file, load_repo_config and FIXTURES_DIR).
application/tests/harvester_test/test_repos_validator.py (1)

14-45: ⚡ Quick win

Consider testing case-insensitive duplicate detection.

The validator uses casefold() for case-insensitive comparison (per the relevant code snippet), but there's no test verifying that repository IDs like "ASVS" and "asvs" are correctly identified as duplicates. Adding such a test would verify this important edge case.

🔤 Proposed case-insensitive duplicate test

Create a fixture duplicate_repo_ids_case_insensitive.yaml:

repositories:
  - id: ASVS
    type: github
    owner: OWASP
    repo: ASVS
    paths:
      include:
        - "**/*.md"
    chunking:
      strategy: markdown
      max_tokens: 1200
    polling:
      mode: incremental
      interval_minutes: 60

  - id: asvs
    type: github
    owner: OWASP
    repo: CheatSheetSeries
    paths:
      include:
        - "**/*.md"
    chunking:
      strategy: markdown
      max_tokens: 1200
    polling:
      mode: incremental
      interval_minutes: 60

Then add the test:

def test_duplicate_repository_ids_case_insensitive():
    config_path = FIXTURES_DIR / "duplicate_repo_ids_case_insensitive.yaml"
    config = load_repo_config(config_path)
    
    with pytest.raises(
        RepositoryValidationError,
        match="Duplicate repository id",
    ):
        validate_repositories(config)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/tests/harvester_test/test_repos_validator.py` around lines 14 -
45, Add a case-insensitive duplicate-id test: create the fixture
duplicate_repo_ids_case_insensitive.yaml with two repositories whose ids differ
only by case (e.g., "ASVS" and "asvs"), then add a test function
test_duplicate_repository_ids_case_insensitive() that loads the fixture via
load_repo_config and asserts validate_repositories(config) raises
RepositoryValidationError with a "Duplicate repository id" match; this ensures
the validate_repositories code path that uses casefold() is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/harvester_test/test_repos_validator.py`:
- Around line 14-45: Add a new success-case test to
application/tests/harvester_test/test_repos_validator.py that ensures
validate_repositories(config) does not raise for a known-good config: create a
test function (e.g., test_validate_valid_repositories) that loads FIXTURES_DIR /
"valid_repos.yaml" via load_repo_config and calls validate_repositories(config)
(no exception assertion needed); reference existing helpers load_repo_config and
the function validate_repositories so the test mirrors the failure tests' style.

In `@application/utils/harvester/__init__.py`:
- Around line 17-27: The __all__ export list is unsorted; reorder the entries
alphabetically so it meets linting standards by sorting the names:
ChunkingConfig, ConfigLoaderError, PathRules, PollingConfig, RepositoryConfig,
RepositoriesValidationError (note actual names below), ReposFile,
load_repo_config, RepositoryValidationError, validate_repositories —
specifically sort the existing symbols ("ChunkingConfig", "ConfigLoaderError",
"PathRules", "PollingConfig", "RepositoryConfig", "ReposFile",
"load_repo_config", "RepositoryValidationError", "validate_repositories") into
ascending alphabetical order and replace the current __all__ definition with
that sorted list.

---

Nitpick comments:
In `@application/tests/harvester_test/test_config_loader.py`:
- Around line 43-45: Replace the string literal path in test_missing_config_file
with a Path-based fixture to match the rest of the tests: call load_repo_config
with FIXTURES_DIR / "does_not_exist.yaml" instead of "does_not_exist.yaml" so
the test uses a Path object (refer to test_missing_config_file, load_repo_config
and FIXTURES_DIR).

In `@application/tests/harvester_test/test_repos_validator.py`:
- Around line 14-45: Add a case-insensitive duplicate-id test: create the
fixture duplicate_repo_ids_case_insensitive.yaml with two repositories whose ids
differ only by case (e.g., "ASVS" and "asvs"), then add a test function
test_duplicate_repository_ids_case_insensitive() that loads the fixture via
load_repo_config and asserts validate_repositories(config) raises
RepositoryValidationError with a "Duplicate repository id" match; this ensures
the validate_repositories code path that uses casefold() is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9f37ad90-960e-469e-bfbe-c6dda95e7b54

📥 Commits

Reviewing files that changed from the base of the PR and between e93ce92 and c7ecbf0.

📒 Files selected for processing (21)
  • application/tests/harvester_test/__init__.py
  • application/tests/harvester_test/fixtures/duplicate_include_paths.yaml
  • application/tests/harvester_test/fixtures/duplicate_repo_ids.yaml
  • application/tests/harvester_test/fixtures/duplicate_repositories.yaml
  • application/tests/harvester_test/fixtures/empty_include_paths.yaml
  • application/tests/harvester_test/fixtures/empty_owner.yaml
  • application/tests/harvester_test/fixtures/invalid_chunk_size.yaml
  • application/tests/harvester_test/fixtures/invalid_chunking_strategy.yaml
  • application/tests/harvester_test/fixtures/invalid_missing_id.yaml
  • application/tests/harvester_test/fixtures/invalid_polling_interval.yaml
  • application/tests/harvester_test/fixtures/invalid_polling_mode.yaml
  • application/tests/harvester_test/fixtures/invalid_yaml.yaml
  • application/tests/harvester_test/fixtures/valid_repos.yaml
  • application/tests/harvester_test/test_config_loader.py
  • application/tests/harvester_test/test_repos_validator.py
  • application/utils/harvester/__init__.py
  • application/utils/harvester/config_loader.py
  • application/utils/harvester/exclude_patterns.txt
  • application/utils/harvester/repos.yaml
  • application/utils/harvester/repos_validator.py
  • application/utils/harvester/schemas.py

Comment thread application/tests/harvester_test/test_repos_validator.py
Comment thread application/utils/harvester/__init__.py
@ParthAggarwal16

Copy link
Copy Markdown
Contributor Author

@coderabbitai
0be6df8
this commit should address it

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh api repos/OWASP/OpenCRE/commits/0be6df87c7642908c76dc915c48b54c30227acff --jq '.commit.message + "\n---\n" + (.files // [] | map("  " + .filename + " [+" + (.additions|tostring) + "/-" + (.deletions|tostring) + "]\n  patch: " + (.patch // "")) | join("\n"))'

Length of output: 1054



@ParthAggarwal16 Confirmed — commit 0be6df8 addresses both concerns:

  1. Missing happy-path test: test_validate_valid_repositories() now covers the success case for validate_repositories(), loading valid_repos.yaml and asserting no exception is raised.
  2. __all__ ordering: RepositoryValidationError is now in correct alphabetical position in application/utils/harvester/__init__.py.

Both look good. Thanks for the fix!

(^‿^)🐇


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant