Run LLM cleanup on original article content before storing#609
Open
mircealungu wants to merge 2 commits into
Open
Run LLM cleanup on original article content before storing#609mircealungu wants to merge 2 commits into
mircealungu wants to merge 2 commits into
Conversation
The readability server + regex-based cleanup leaves visible junk in the extracted body — "Listen to the article" buttons, related-article links, share widgets, newsletter blurbs. With the reading-flow flip toward send-to-Zeeguu being the default, the original-content view becomes the primary read path, so this junk is what every learner sees on first open. Adds a Haiku-based cleanup pass in SimplificationService and wires it into Article.find_or_create() between language detection and Source creation. Synchronous on the send path — latency is acceptable, junk on first read isn't. Fail-soft: on any LLM error the original content is kept. Guardrails: - skip if body < 500 chars (not worth latency) - skip if ANTHROPIC_TEXT_SIMPLIFICATION_KEY missing - reject results where cleaned length is >110% or <50% of original (catches hallucination and over-deletion) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Gate cleanup on source_upload_id being set: only the user-upload
(send-to-Zeeguu) path pays the Haiku cost. Crawler ingestion is
being phased out and shouldn't trigger per-article LLM calls.
- Send HTML only, derive plain text from cleaned HTML via
BeautifulSoup. Halves input tokens and removes a parsing
inconsistency (LLM-rewritten text could drift from cleaned HTML).
- Drop unused language_code param.
- Hoist `import json` to module top.
- Extract CLEANUP_MIN_BODY_CHARS / MAX_GROWTH / MIN_RETENTION as
module constants (replaces magic 500 / 1.1 / 0.5 literals).
- Use removeprefix/removesuffix for fence stripping (the previous
`raw.strip("`")` was over-greedy, stripping backticks anywhere).
- max_tokens 16000 -> 8000 (cap runaway output cost).
- timeout 60 -> 30 (matches other Anthropic calls in this file;
fail-soft fallback makes shorter timeout safe).
- Trim docstring/comment redundancy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
cleanup_article_content()onSimplificationService— a Haiku pass that strips leftover scraper noise (audio/share widgets, "Listen to the article" buttons, related-article link lists, newsletter blurbs, ad text, cookie banners) from the body returned by the readability server.Article.find_or_create()between language detection andSource.find_or_create(), so both the upload-with-text path and the readability path get cleaned. The cleaned versions are what land inSource.contentandArticle.htmlContent, so the reader sees a clean article on first open.Why
With the reading-flow flip toward send-to-Zeeguu as the default and full-article-externally being torn out, the original-content view becomes the primary read path — so the leftover noise that readability + the regex
cleanup_non_content_bits()miss is now what every learner sees. A real example that prompted this: an article opened with "Listen to the article" appearing inline in the body because the source page had a TTS-button caption that readability kept.Design notes
Noneand the original content is kept. The ingestion path never breaks on a flaky API.ANTHROPIC_TEXT_SIMPLIFICATION_KEYis unsetTest plan
ANTHROPIC_TEXT_SIMPLIFICATION_KEYunset, confirm ingestion still works and the original content is keptAnthropic cleanup ...lines on staging before merging🤖 Generated with Claude Code