Run LLM cleanup on original article content before storing by mircealungu · Pull Request #609 · zeeguu/api

mircealungu · 2026-05-14T21:26:19Z

Summary

Adds cleanup_article_content() on SimplificationService — a Haiku pass that strips leftover scraper noise (audio/share widgets, "Listen to the article" buttons, related-article link lists, newsletter blurbs, ad text, cookie banners) from the body returned by the readability server.
Wires it into Article.find_or_create() between language detection and Source.find_or_create(), so both the upload-with-text path and the readability path get cleaned. The cleaned versions are what land in Source.content and Article.htmlContent, so the reader sees a clean article on first open.
Synchronous on the send-to-Zeeguu path. Latency is acceptable; junk on first read isn't.

Why

With the reading-flow flip toward send-to-Zeeguu as the default and full-article-externally being torn out, the original-content view becomes the primary read path — so the leftover noise that readability + the regex cleanup_non_content_bits() miss is now what every learner sees. A real example that prompted this: an article opened with "Listen to the article" appearing inline in the body because the source page had a TTS-button caption that readability kept.

Design notes

Fail-soft: any LLM error returns None and the original content is kept. The ingestion path never breaks on a flaky API.
Guardrails to avoid hallucination/over-deletion:
- skip bodies under 500 chars (not worth the latency, rarely have junk)
- skip if ANTHROPIC_TEXT_SIMPLIFICATION_KEY is unset
- reject the result if cleaned text is >110% or <50% of original length
Prompt is conservative: model is told to remove only clearly non-article fragments, not to translate / simplify / summarize / reorder.
Not addressed in this PR (deliberately, to keep scope small):
- French crawl-path bundling (cleanup-then-simplify in one LLM call for the existing crawl-time pipeline)
- Completeness signal + filter-from-recommendations for articles that rely on missing interactive content

Test plan

Send a noisy article (e.g. one with a "Listen to the article" button visible in the body) via the extension/share flow and confirm the reader view no longer shows the widget text
Send an article on a domain that already cleans well — confirm no over-deletion (no missing paragraphs) and that the guardrail keeps the original when cleanup goes wrong
Send a very short article (<500 chars body) and confirm cleanup is skipped (no extra latency)
With ANTHROPIC_TEXT_SIMPLIFICATION_KEY unset, confirm ingestion still works and the original content is kept
Watch logs for Anthropic cleanup ... lines on staging before merging

🤖 Generated with Claude Code

The readability server + regex-based cleanup leaves visible junk in the extracted body — "Listen to the article" buttons, related-article links, share widgets, newsletter blurbs. With the reading-flow flip toward send-to-Zeeguu being the default, the original-content view becomes the primary read path, so this junk is what every learner sees on first open. Adds a Haiku-based cleanup pass in SimplificationService and wires it into Article.find_or_create() between language detection and Source creation. Synchronous on the send path — latency is acceptable, junk on first read isn't. Fail-soft: on any LLM error the original content is kept. Guardrails: - skip if body < 500 chars (not worth latency) - skip if ANTHROPIC_TEXT_SIMPLIFICATION_KEY missing - reject results where cleaned length is >110% or <50% of original (catches hallucination and over-deletion) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-14T21:27:12Z

ArchLens detected architectural changes in the following views:

- Gate cleanup on source_upload_id being set: only the user-upload (send-to-Zeeguu) path pays the Haiku cost. Crawler ingestion is being phased out and shouldn't trigger per-article LLM calls. - Send HTML only, derive plain text from cleaned HTML via BeautifulSoup. Halves input tokens and removes a parsing inconsistency (LLM-rewritten text could drift from cleaned HTML). - Drop unused language_code param. - Hoist `import json` to module top. - Extract CLEANUP_MIN_BODY_CHARS / MAX_GROWTH / MIN_RETENTION as module constants (replaces magic 500 / 1.1 / 0.5 literals). - Use removeprefix/removesuffix for fence stripping (the previous `raw.strip("`")` was over-greedy, stripping backticks anywhere). - max_tokens 16000 -> 8000 (cap runaway output cost). - timeout 60 -> 30 (matches other Anthropic calls in this file; fail-soft fallback makes shorter timeout safe). - Trim docstring/comment redundancy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run LLM cleanup on original article content before storing#609

Run LLM cleanup on original article content before storing#609
mircealungu wants to merge 2 commits into
masterfrom
llm-cleanup-original-article-content

mircealungu commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mircealungu commented May 14, 2026

Summary

Why

Design notes

Test plan

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 14, 2026 •

edited

Loading