Skip to content

Run LLM cleanup on original article content before storing#609

Open
mircealungu wants to merge 2 commits into
masterfrom
llm-cleanup-original-article-content
Open

Run LLM cleanup on original article content before storing#609
mircealungu wants to merge 2 commits into
masterfrom
llm-cleanup-original-article-content

Conversation

@mircealungu
Copy link
Copy Markdown
Member

Summary

  • Adds cleanup_article_content() on SimplificationService — a Haiku pass that strips leftover scraper noise (audio/share widgets, "Listen to the article" buttons, related-article link lists, newsletter blurbs, ad text, cookie banners) from the body returned by the readability server.
  • Wires it into Article.find_or_create() between language detection and Source.find_or_create(), so both the upload-with-text path and the readability path get cleaned. The cleaned versions are what land in Source.content and Article.htmlContent, so the reader sees a clean article on first open.
  • Synchronous on the send-to-Zeeguu path. Latency is acceptable; junk on first read isn't.

Why

With the reading-flow flip toward send-to-Zeeguu as the default and full-article-externally being torn out, the original-content view becomes the primary read path — so the leftover noise that readability + the regex cleanup_non_content_bits() miss is now what every learner sees. A real example that prompted this: an article opened with "Listen to the article" appearing inline in the body because the source page had a TTS-button caption that readability kept.

Design notes

  • Fail-soft: any LLM error returns None and the original content is kept. The ingestion path never breaks on a flaky API.
  • Guardrails to avoid hallucination/over-deletion:
    • skip bodies under 500 chars (not worth the latency, rarely have junk)
    • skip if ANTHROPIC_TEXT_SIMPLIFICATION_KEY is unset
    • reject the result if cleaned text is >110% or <50% of original length
  • Prompt is conservative: model is told to remove only clearly non-article fragments, not to translate / simplify / summarize / reorder.
  • Not addressed in this PR (deliberately, to keep scope small):
    • French crawl-path bundling (cleanup-then-simplify in one LLM call for the existing crawl-time pipeline)
    • Completeness signal + filter-from-recommendations for articles that rely on missing interactive content

Test plan

  • Send a noisy article (e.g. one with a "Listen to the article" button visible in the body) via the extension/share flow and confirm the reader view no longer shows the widget text
  • Send an article on a domain that already cleans well — confirm no over-deletion (no missing paragraphs) and that the guardrail keeps the original when cleanup goes wrong
  • Send a very short article (<500 chars body) and confirm cleanup is skipped (no extra latency)
  • With ANTHROPIC_TEXT_SIMPLIFICATION_KEY unset, confirm ingestion still works and the original content is kept
  • Watch logs for Anthropic cleanup ... lines on staging before merging

🤖 Generated with Claude Code

The readability server + regex-based cleanup leaves visible junk in the
extracted body — "Listen to the article" buttons, related-article links,
share widgets, newsletter blurbs. With the reading-flow flip toward
send-to-Zeeguu being the default, the original-content view becomes the
primary read path, so this junk is what every learner sees on first open.

Adds a Haiku-based cleanup pass in SimplificationService and wires it into
Article.find_or_create() between language detection and Source creation.
Synchronous on the send path — latency is acceptable, junk on first read
isn't. Fail-soft: on any LLM error the original content is kept.

Guardrails:
- skip if body < 500 chars (not worth latency)
- skip if ANTHROPIC_TEXT_SIMPLIFICATION_KEY missing
- reject results where cleaned length is >110% or <50% of original
  (catches hallucination and over-deletion)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

ArchLens detected architectural changes in the following views:
diff

- Gate cleanup on source_upload_id being set: only the user-upload
  (send-to-Zeeguu) path pays the Haiku cost. Crawler ingestion is
  being phased out and shouldn't trigger per-article LLM calls.
- Send HTML only, derive plain text from cleaned HTML via
  BeautifulSoup. Halves input tokens and removes a parsing
  inconsistency (LLM-rewritten text could drift from cleaned HTML).
- Drop unused language_code param.
- Hoist `import json` to module top.
- Extract CLEANUP_MIN_BODY_CHARS / MAX_GROWTH / MIN_RETENTION as
  module constants (replaces magic 500 / 1.1 / 0.5 literals).
- Use removeprefix/removesuffix for fence stripping (the previous
  `raw.strip("`")` was over-greedy, stripping backticks anywhere).
- max_tokens 16000 -> 8000 (cap runaway output cost).
- timeout 60 -> 30 (matches other Anthropic calls in this file;
  fail-soft fallback makes shorter timeout safe).
- Trim docstring/comment redundancy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant