Skip to content

feat(cli): ingest local user-provided content into searchable corpora#28

Open
Vadelo wants to merge 3 commits into
deandevz:mainfrom
Vadelo:feat/issue-24-user-content-ingest
Open

feat(cli): ingest local user-provided content into searchable corpora#28
Vadelo wants to merge 3 commits into
deandevz:mainfrom
Vadelo:feat/issue-24-user-content-ingest

Conversation

@Vadelo

@Vadelo Vadelo commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR reworks kctx ingest to align with the existing King Context retrieval architecture instead of maintaining a parallel ingestion/chunking path.

The feature is now intentionally narrower and more predictable:

  • Markdown only
  • one section per file
  • metadata generated through the existing scraper enrichment flow

What changed

  • reduced ingestion scope to .md files only
  • removed the custom Markdown chunking pipeline
  • removed synthetic metadata generation
  • reused the existing scraper enrichment pipeline for:
    • keywords
    • use cases
    • tags
    • priority
  • kept only the source metadata that is currently useful:
    • source_type
    • source_file
  • removed ingest chunk flags from the CLI
  • removed module-level path rebinding from _cmd_ingest
  • updated docs and tests to match the reduced contract

Why this approach

The first version solved the feature in a parallel way, but it duplicated behavior that the project already has in the scraper pipeline.

This rework keeps the feature aligned with the project’s stronger invariant:

  • retrieval quality should come from the same enrichment model already used elsewhere
  • ingestion should stay small, predictable, and easy to maintain
  • the CLI should not invent a second metadata/chunking strategy

Behavior

kctx ingest now:

  • accepts a Markdown file or directory
  • ignores hidden files and common heavy directories
  • creates one enriched section per Markdown file
  • writes the resulting JSON into:
    • .king-context/data/<name>.json
    • or .king-context/data/research/<name>.json
  • optionally indexes it immediately into the local store

Testing

Validated with:

  • python -m py_compile src/context_cli/ingest.py src/context_cli/cli.py src/context_cli/indexer.py src/king_context/scraper/__init__.py tests/test_context_cli/test_ingest.py tests/test_context_cli/test_cli.py
  • python -c "import context_cli.ingest; print('ingest-import-ok')"
  • python -m context_cli.cli ingest --help
  • pytest tests/test_context_cli/test_ingest.py tests/test_context_cli/test_cli.py -q

Result:

  • 31 passed

Scope

This PR is intentionally limited to Markdown ingestion.

Follow-up support for other formats such as txt, pdf, srt, or vtt should be added separately with their own parsing and failure-mode handling.

@Vadelo Vadelo changed the title feat/issue 24 user content ingest feat(cli): ingest local user-provided content into searchable corpora May 4, 2026

@deandevz deandevz left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Vadelo, thanks for taking this on. The direction here is right and the basic shape is the right plug-in point. I want to ship this feature, but the current implementation has a few things that need to be cut down or reworked before I can merge. Laying out the full picture below so you can rework it in one pass.

I also want to be upfront about one thing: the original issue (#24) was an idea ticket I wrote in a deliberately vague way. That gave you very little to ground decisions on, and a lot of the friction below comes from that. My fault, not yours.

What works and stays

  • Plug-in point on the export side is correct: write JSON to .king-context/data/[research/]<name>.json and call index_doc. The piece that needs to be added is the enrichment step before the export, see point 1.
  • Hidden file and binary ignore defaults are sensible.
  • The CLI ingestion summary (scanned / ignored / extensions) is useful.
  • Your added tests pass locally on my side.

What needs to change

1. Retrieval is broken because the synthetic metadata duplicates the scraper instead of reusing it

The CLI retrieval path (searcher.py, reader.py, grep.py, topics) reads from the JSON metadata indexes that index_doc builds: keywords.json, tags.json, use_cases.json. Those indexes are only as useful as the keywords / use_cases / tags written into each section. The FTS5 and embedding layers live in the MCP server side (king_context.db), not in the CLI flow, so they do not compensate when CLI metadata is weak.

Right now, the synthetic metadata for ingested sections is derived from the filename and title only:

keywords:  ['sample', 'intro']
use_cases: ['Reference user-provided content from sample.md']
tags:      ['user-content', 'markdown']
priority:  5

I tested locally with a tiny corpus (sample.md containing ## Intro\nThis is a sample note.):

  • kctx search "sample" finds it (the word is in the slug).
  • kctx search "note" returns "No results", even though the word is literally in the content. The CLI cannot find it because the metadata indexes do not contain it.

This is not a tunable problem. The fix is to reuse the existing enrichment path that already powers the scraper and research flows: king_context.scraper.enrich. That pipeline produces real keywords, use_cases, tags, and priority from content, has been used in production for the data/ and data/research/ corpora, and is the metadata contract the rest of the CLI was designed around.

Concretely:

  • Route ingested files through the existing enrichment entry point in king_context.scraper.enrich before exporting to JSON, so ingested sections get the same shape of metadata as scraped and researched sections.
  • Default behavior: enrichment on, matching king-scrape. A --no-enrich flag for offline ingest is fine if you want it, but not required.
  • Remove the synthetic _keywords_for_chunk, _use_case_for_file, _tags_for_extension, and the hardcoded priority: 5. They look like real metadata but are not, and they create a second metadata-generation method to maintain alongside the scraper one.

Building a parallel metadata path duplicates code that already works and splits the project into two ways of doing the same thing. Please reuse the proven path.

2. Section titles get dropped silently

_merge_small_sections merges any section under chunk_min_tokens (default 50) into the previous one, and discards the second title. Reproducible:

Input markdown:
## Intro
short content

## Section Two
short content

Output JSON:
1 section, title="Intro",
content="short content\n\nshort content"

For short personal notes this happens constantly. Your own tests pass chunk_min_tokens=1 to work around it, which suggests you noticed it. Please fix at the default level (most likely by removing the chunker entirely, see point 3).

3. Drop the custom chunker

_split_markdown_sections, _subdivide_content, _merge_small_sections, _split_paragraphs, _estimate_tokens reimplement chunking that already exists in the scraper pipeline. The simplest path is to emit one section per file. The enrichment step from point 1 then runs on that single section and produces the real metadata. If sub-file chunking is needed later, route through the scraper's chunker rather than maintaining a parallel one. Either way, the custom helpers go.

4. Cut metadata fields from five to one

You added source_type, source_file, source_format, source_collection, source_kind. None of these are read anywhere in searcher.py, reader.py, grep.py, or the topics flow. The existing pattern in research/export.py uses only source_type: "research". Please follow the same shape:

  • Keep source_type: "user-content" so the auto-routing in cli.py:_detect_source can pick it up.
  • Optional: source_file for traceability when the user wants to know where a section came from.
  • Drop source_format, source_collection, source_kind. They are derived from filename or doc name and add no retrieval value.

The block in indexer.py that propagates these five fields shrinks or goes away accordingly.

5. Drop --chunk-max-tokens and --chunk-min-tokens

If the chunker goes (point 3), these knobs go with it. They configure something only the contributor will tune.

6. Reduce this PR scope to .md only

Each of .txt, .pdf, .srt, .vtt has its own design surface that does not exist for .md. Trying to ship them all in one PR mixes problems that need different solutions:

  • .txt: looks simple but has its own edge cases (encoding detection, BOM handling, line-ending normalization, very large files). Worth its own focused pass once .md is solid.
  • .pdf: not the simple case it looks like. PDFs can contain embedded images, scanned pages where text is rendered as image, mixed structures with tables and equations, and content that breaks plain text extraction. Image storage and retrieval, OCR for scanned pages, and structure-aware parsing each warrant their own design call. Text-only PDFs work with pypdf, but the moment a user drops in a scanned PDF the result is empty or garbage. Shipping a PDF flag that silently fails on half of real PDFs is worse than not shipping it.
  • .srt and .vtt: raw text reading mixes timestamps into the section content. The exported JSON contains the WEBVTT header and lines like 00:00.000 --> 00:01.000 mixed with the spoken text, which pollutes what the agent reads and hurts token efficiency. Doing transcripts properly needs cue parsing, structured time metadata (start_time, end_time per section), and chunking by temporal topic block, which is its own enrich pass.

For this PR, please drop .txt, .pdf, .srt, and .vtt and ship the .md path only. Markdown is what the existing scraped corpora already use, so it slots into the current schema with the least friction. I will open follow-up issues for each of the other formats with proper scope, and you are welcome to take any of them on.

7. Remove the module-state rebinding in _cmd_ingest

ingest_mod.PROJECT_ROOT = PROJECT_ROOT
ingest_mod.STORE_DIR = STORE_DIR
ingest_mod.RESEARCH_STORE_DIR = RESEARCH_STORE_DIR

This exists only because the tests patch cli_mod but ingest_mod reads its own constants. Two cleaner approaches:

  • Pass store_dir and research_store_dir as arguments to ingest_path and resolve them in _cmd_ingest from the cli module constants.
  • Or have the ingest tests patch ingest_mod directly, so the module-level imports do their job.

Either way, runtime rebinding does not belong in the production path.

Suggested final shape

After the cuts, the PR should look roughly like:

  • ingest.py: around 80 to 120 lines covering file walk, ignore rules, .md reading, one-section-per-file emission, and a call into king_context.scraper.enrich to populate the section metadata.
  • cli.py: the _cmd_ingest and the ingest subparser, slimmed down.
  • indexer.py: minimal additive change, propagating only the one or two metadata fields that survive.
  • Tests for the reduced surface.
  • docs/CLI_GUIDE.md shortened to match.

That ends up a smaller diff and a cleaner contract to extend later.

Local checks I ran on this PR

  • pytest: 424 passed (ignoring two pre-existing collection errors unrelated to this PR, caused by a missing respx dev dep in my env).
  • pyproject.toml parse: ok.
  • Smoke kctx ingest ./notes: works, summary correct, JSON written, indexed.
  • Smoke kctx ingest transcript.vtt --source research: writes JSON with the timestamp pollution from point 6.
  • kctx search "<word from content>" against the ingested corpus: misses, as described in point 1.

Really appreciate the time you put into this. The core idea is one I want in the project and the call to plug into the existing JSON to indexer flow was correct. Once the points above are addressed I will rerun the suite, smoke test, and merge.

@Vadelo

Vadelo commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Reworked this PR to align with the review feedback and remove the parallel retrieval path.

Main changes in this update:

  • reduced scope to .md only for now
  • removed the custom chunking / section-splitting pipeline entirely
  • switched ingestion to reuse the existing scraper enrichment flow
  • removed synthetic metadata generation
  • kept only the source metadata that is currently useful:
    • source_type
    • source_file
  • removed the chunk-related ingest flags from the CLI
  • removed the module-level state rebinding from _cmd_ingest
  • updated docs and tests to match the narrower contract
  • restored lazy scraper imports where needed so ingest/tests do not eagerly pull unrelated dependencies

The intent of this pass was to stop treating ingest as a separate retrieval architecture and instead make it a thin adapter into the existing King Context flow.

Validation I ran after the rework:

  • python -m py_compile src/context_cli/ingest.py src/context_cli/cli.py src/context_cli/indexer.py src/king_context/scraper/__init__.py tests/test_context_cli/test_ingest.py tests/test_context_cli/test_cli.py
  • python -c "import context_cli.ingest; print('ingest-import-ok')"
  • python -m context_cli.cli ingest --help
  • pytest tests/test_context_cli/test_ingest.py tests/test_context_cli/test_cli.py -q

Result:

  • 31 passed

I also added coverage for the specific failure patterns that were part of the original review concern:

  • one section per markdown file
  • no leftover chunk flags in the CLI help
  • no synthetic source metadata fields in indexed sections
  • post-ingest search still works through enriched metadata
  • partial enrichment failures surface explicitly instead of failing silently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants