Problem
During analysis, article extraction fails for a wide range of reasons (paywalls, bot protection, dead links, DNS errors, TLS chain issues). Today the logs are informative, but the system does not clearly separate:
- expected/blocked failures (LinkedIn 999, 403/402, 410, etc.)
- transient/network failures worth retrying
- actionable configuration issues (e.g., missing CA chain in Docker)
This makes it hard to maintain extraction quality and to prioritize fixes.
Goal
Add structured failure classification + an explicit retry policy that keeps TLS verification enabled.
Proposed approach
- Introduce an error classifier that maps failures into categories:
- blocked (anti-bot/paywall)
- not_found (404/410)
- network (ENOTFOUND/EAI_AGAIN/timeouts)
- tls (UNABLE_TO_VERIFY_LEAF_SIGNATURE, etc.)
- malformed_response (HPE_* parse errors)
- Persist classification + metadata in the extraction retry queue (where applicable)
- Retry policy by category:
- network/tls: retry with exponential backoff up to N times
- blocked/not_found: do not retry (or retry once max)
- Add a small report at end of analyze:
- counts by category
- top failing domains
Acceptance criteria
- Analyze output includes a summary of extraction failures by category
- Retry queue only retries categories that make sense
- No TLS verification disabling
Tasks
Problem
During analysis, article extraction fails for a wide range of reasons (paywalls, bot protection, dead links, DNS errors, TLS chain issues). Today the logs are informative, but the system does not clearly separate:
This makes it hard to maintain extraction quality and to prioritize fixes.
Goal
Add structured failure classification + an explicit retry policy that keeps TLS verification enabled.
Proposed approach
Acceptance criteria
Tasks