feat(cli): ingest local user-provided content into searchable corpora#28
feat(cli): ingest local user-provided content into searchable corpora#28Vadelo wants to merge 3 commits into
Conversation
deandevz
left a comment
There was a problem hiding this comment.
Hey @Vadelo, thanks for taking this on. The direction here is right and the basic shape is the right plug-in point. I want to ship this feature, but the current implementation has a few things that need to be cut down or reworked before I can merge. Laying out the full picture below so you can rework it in one pass.
I also want to be upfront about one thing: the original issue (#24) was an idea ticket I wrote in a deliberately vague way. That gave you very little to ground decisions on, and a lot of the friction below comes from that. My fault, not yours.
What works and stays
- Plug-in point on the export side is correct: write JSON to
.king-context/data/[research/]<name>.jsonand callindex_doc. The piece that needs to be added is the enrichment step before the export, see point 1. - Hidden file and binary ignore defaults are sensible.
- The CLI ingestion summary (scanned / ignored / extensions) is useful.
- Your added tests pass locally on my side.
What needs to change
1. Retrieval is broken because the synthetic metadata duplicates the scraper instead of reusing it
The CLI retrieval path (searcher.py, reader.py, grep.py, topics) reads from the JSON metadata indexes that index_doc builds: keywords.json, tags.json, use_cases.json. Those indexes are only as useful as the keywords / use_cases / tags written into each section. The FTS5 and embedding layers live in the MCP server side (king_context.db), not in the CLI flow, so they do not compensate when CLI metadata is weak.
Right now, the synthetic metadata for ingested sections is derived from the filename and title only:
keywords: ['sample', 'intro']
use_cases: ['Reference user-provided content from sample.md']
tags: ['user-content', 'markdown']
priority: 5
I tested locally with a tiny corpus (sample.md containing ## Intro\nThis is a sample note.):
kctx search "sample"finds it (the word is in the slug).kctx search "note"returns "No results", even though the word is literally in the content. The CLI cannot find it because the metadata indexes do not contain it.
This is not a tunable problem. The fix is to reuse the existing enrichment path that already powers the scraper and research flows: king_context.scraper.enrich. That pipeline produces real keywords, use_cases, tags, and priority from content, has been used in production for the data/ and data/research/ corpora, and is the metadata contract the rest of the CLI was designed around.
Concretely:
- Route ingested files through the existing enrichment entry point in
king_context.scraper.enrichbefore exporting to JSON, so ingested sections get the same shape of metadata as scraped and researched sections. - Default behavior: enrichment on, matching
king-scrape. A--no-enrichflag for offline ingest is fine if you want it, but not required. - Remove the synthetic
_keywords_for_chunk,_use_case_for_file,_tags_for_extension, and the hardcodedpriority: 5. They look like real metadata but are not, and they create a second metadata-generation method to maintain alongside the scraper one.
Building a parallel metadata path duplicates code that already works and splits the project into two ways of doing the same thing. Please reuse the proven path.
2. Section titles get dropped silently
_merge_small_sections merges any section under chunk_min_tokens (default 50) into the previous one, and discards the second title. Reproducible:
Input markdown:
## Intro
short content
## Section Two
short content
Output JSON:
1 section, title="Intro",
content="short content\n\nshort content"
For short personal notes this happens constantly. Your own tests pass chunk_min_tokens=1 to work around it, which suggests you noticed it. Please fix at the default level (most likely by removing the chunker entirely, see point 3).
3. Drop the custom chunker
_split_markdown_sections, _subdivide_content, _merge_small_sections, _split_paragraphs, _estimate_tokens reimplement chunking that already exists in the scraper pipeline. The simplest path is to emit one section per file. The enrichment step from point 1 then runs on that single section and produces the real metadata. If sub-file chunking is needed later, route through the scraper's chunker rather than maintaining a parallel one. Either way, the custom helpers go.
4. Cut metadata fields from five to one
You added source_type, source_file, source_format, source_collection, source_kind. None of these are read anywhere in searcher.py, reader.py, grep.py, or the topics flow. The existing pattern in research/export.py uses only source_type: "research". Please follow the same shape:
- Keep
source_type: "user-content"so the auto-routing incli.py:_detect_sourcecan pick it up. - Optional:
source_filefor traceability when the user wants to know where a section came from. - Drop
source_format,source_collection,source_kind. They are derived from filename or doc name and add no retrieval value.
The block in indexer.py that propagates these five fields shrinks or goes away accordingly.
5. Drop --chunk-max-tokens and --chunk-min-tokens
If the chunker goes (point 3), these knobs go with it. They configure something only the contributor will tune.
6. Reduce this PR scope to .md only
Each of .txt, .pdf, .srt, .vtt has its own design surface that does not exist for .md. Trying to ship them all in one PR mixes problems that need different solutions:
.txt: looks simple but has its own edge cases (encoding detection, BOM handling, line-ending normalization, very large files). Worth its own focused pass once.mdis solid..pdf: not the simple case it looks like. PDFs can contain embedded images, scanned pages where text is rendered as image, mixed structures with tables and equations, and content that breaks plain text extraction. Image storage and retrieval, OCR for scanned pages, and structure-aware parsing each warrant their own design call. Text-only PDFs work withpypdf, but the moment a user drops in a scanned PDF the result is empty or garbage. Shipping a PDF flag that silently fails on half of real PDFs is worse than not shipping it..srtand.vtt: raw text reading mixes timestamps into the section content. The exported JSON contains theWEBVTTheader and lines like00:00.000 --> 00:01.000mixed with the spoken text, which pollutes what the agent reads and hurts token efficiency. Doing transcripts properly needs cue parsing, structured time metadata (start_time,end_timeper section), and chunking by temporal topic block, which is its own enrich pass.
For this PR, please drop .txt, .pdf, .srt, and .vtt and ship the .md path only. Markdown is what the existing scraped corpora already use, so it slots into the current schema with the least friction. I will open follow-up issues for each of the other formats with proper scope, and you are welcome to take any of them on.
7. Remove the module-state rebinding in _cmd_ingest
ingest_mod.PROJECT_ROOT = PROJECT_ROOT
ingest_mod.STORE_DIR = STORE_DIR
ingest_mod.RESEARCH_STORE_DIR = RESEARCH_STORE_DIR
This exists only because the tests patch cli_mod but ingest_mod reads its own constants. Two cleaner approaches:
- Pass
store_dirandresearch_store_diras arguments toingest_pathand resolve them in_cmd_ingestfrom the cli module constants. - Or have the ingest tests patch
ingest_moddirectly, so the module-level imports do their job.
Either way, runtime rebinding does not belong in the production path.
Suggested final shape
After the cuts, the PR should look roughly like:
ingest.py: around 80 to 120 lines covering file walk, ignore rules,.mdreading, one-section-per-file emission, and a call intoking_context.scraper.enrichto populate the section metadata.cli.py: the_cmd_ingestand theingestsubparser, slimmed down.indexer.py: minimal additive change, propagating only the one or two metadata fields that survive.- Tests for the reduced surface.
docs/CLI_GUIDE.mdshortened to match.
That ends up a smaller diff and a cleaner contract to extend later.
Local checks I ran on this PR
pytest: 424 passed (ignoring two pre-existing collection errors unrelated to this PR, caused by a missingrespxdev dep in my env).pyproject.tomlparse: ok.- Smoke
kctx ingest ./notes: works, summary correct, JSON written, indexed. - Smoke
kctx ingest transcript.vtt --source research: writes JSON with the timestamp pollution from point 6. kctx search "<word from content>"against the ingested corpus: misses, as described in point 1.
Really appreciate the time you put into this. The core idea is one I want in the project and the call to plug into the existing JSON to indexer flow was correct. Once the points above are addressed I will rerun the suite, smoke test, and merge.
|
Reworked this PR to align with the review feedback and remove the parallel retrieval path. Main changes in this update:
The intent of this pass was to stop treating ingest as a separate retrieval architecture and instead make it a thin adapter into the existing King Context flow. Validation I ran after the rework:
Result:
I also added coverage for the specific failure patterns that were part of the original review concern:
|
Summary
This PR reworks
kctx ingestto align with the existing King Context retrieval architecture instead of maintaining a parallel ingestion/chunking path.The feature is now intentionally narrower and more predictable:
What changed
.mdfiles onlysource_typesource_file_cmd_ingestWhy this approach
The first version solved the feature in a parallel way, but it duplicated behavior that the project already has in the scraper pipeline.
This rework keeps the feature aligned with the project’s stronger invariant:
Behavior
kctx ingestnow:.king-context/data/<name>.json.king-context/data/research/<name>.jsonTesting
Validated with:
python -m py_compile src/context_cli/ingest.py src/context_cli/cli.py src/context_cli/indexer.py src/king_context/scraper/__init__.py tests/test_context_cli/test_ingest.py tests/test_context_cli/test_cli.pypython -c "import context_cli.ingest; print('ingest-import-ok')"python -m context_cli.cli ingest --helppytest tests/test_context_cli/test_ingest.py tests/test_context_cli/test_cli.py -qResult:
31 passedScope
This PR is intentionally limited to Markdown ingestion.
Follow-up support for other formats such as
txt,pdf,srt, orvttshould be added separately with their own parsing and failure-mode handling.