Skip to content

Improve extraction failure handling: classify expected blocks vs actionable failures + retry strategy #3

Description

@szverev

Problem

During analysis, article extraction fails for a wide range of reasons (paywalls, bot protection, dead links, DNS errors, TLS chain issues). Today the logs are informative, but the system does not clearly separate:

  • expected/blocked failures (LinkedIn 999, 403/402, 410, etc.)
  • transient/network failures worth retrying
  • actionable configuration issues (e.g., missing CA chain in Docker)

This makes it hard to maintain extraction quality and to prioritize fixes.

Goal

Add structured failure classification + an explicit retry policy that keeps TLS verification enabled.

Proposed approach

  • Introduce an error classifier that maps failures into categories:
    • blocked (anti-bot/paywall)
    • not_found (404/410)
    • network (ENOTFOUND/EAI_AGAIN/timeouts)
    • tls (UNABLE_TO_VERIFY_LEAF_SIGNATURE, etc.)
    • malformed_response (HPE_* parse errors)
  • Persist classification + metadata in the extraction retry queue (where applicable)
  • Retry policy by category:
    • network/tls: retry with exponential backoff up to N times
    • blocked/not_found: do not retry (or retry once max)
  • Add a small report at end of analyze:
    • counts by category
    • top failing domains

Acceptance criteria

  • Analyze output includes a summary of extraction failures by category
  • Retry queue only retries categories that make sense
  • No TLS verification disabling

Tasks

  • Implement classification
  • Update queue persistence schema/fields if needed
  • Add end-of-run report
  • Add docs on common failure categories

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions