test: comprehensive email reader test suite (113 tests) + 3 bugfixes by LiamDGray · Pull Request #343 · StarTrail-org/LEANN

LiamDGray · 2026-05-30T22:38:10Z

Summary

Adds a comprehensive 113-test suite for the LEANN email reader, covering all helper functions, thread detection patterns, attachment extraction, document building, and EmlReader/EmlxReader integration — plus 3 bug fixes discovered during testing.

Areas Covered

Helpers: _payload_to_text, _strip_html, _decode_subject, _decode_addr (32 tests)
Thread Detection: All 5 reply/forward patterns — Gmail, Outlook, forward header block, underscore line, Forwarded Message (13 tests)
RTF Extraction: 6 tests across all 3 methods (\htmlrtf0, embedded HTML, generic strip)
Attachment Extraction: 14 tests covering PDF, DOCX, TXT, CSV, MD, JSON, XML, YAML, RTF
Document Building: 14 tests — simple, HTML, multipart, thread separation, RTF body fallback, calendar invites, encoded subjects, edge cases
EmlReader: 10 integration tests — single/multiple, max_count, hidden dirs, malformed files
EmlxReader: 5 integration tests — length prefix format, extension filtering
Edge Cases: 12 tests — unicode, CRLF, bulk 100, nested multipart, extensionless RTF attachments

Bug Fixes

_extract_rtf_text method 3 — was using \{[^}]*\} which deleted all text inside RTF groups. Replaced with direct brace removal + control word stripping to preserve actual content.
_extract_attachment_text — extensionless rtf-body attachments from PST exports couldn't be parsed. Added optional content_type parameter with RTF fallback.
_build_email_document — RTF body fallback now does direct _extract_rtf_text when extracted text is empty, handling PST export attachments without file extensions.

Test Plan

python -m pytest tests/test_email_readers.py -q

→ 113 passed, all green.

- New EmlReader class parses .eml files from a directory - Refactored shared email parsing helpers (_payload_to_text, _parse_email_message) - Updated email_rag.py with --eml-path flag alongside existing --mail-path - Fixed relative import in email_rag.py so it runs directly - All existing EmlxReader functionality preserved

…eml reader Core features: - Reply/forward thread detection for Gmail, Outlook, and forwarded message formats - Main content prioritized over quoted/replied text in RAG indexing - Attachment text extraction for PDF (PyPDF2), Word (python-docx), TXT, RTF, CSV - RTF body fallback for PST exports (TUSD emails) — MS Exchange \htmlrtf0 markers - RFC 2047 subject decoding for international characters - BeautifulSoup HTML stripping for HTML-only emails All features use lazy imports with graceful fallbacks when optional deps are missing.

Coverage includes: - Helper functions: _payload_to_text, _strip_html, _decode_subject, _decode_addr, _split_quoted_thread, _extract_rtf_text, _extract_attachment_text, _collect_attachments - Thread detection: all 5 patterns (Gmail, Outlook, forward header block, underscore line, Forwarded Message) - Attachment extraction: PDF, DOCX, TXT, CSV, MD, JSON, XML, YAML, RTF - Document building: simple, HTML, multipart, thread separation, RTF body fallback, calendar invites, encoded subjects, edge cases - EmlReader: single/multiple files, max_count, hidden dirs, malformed files, nested directories, include_html, attachment extraction - EmlxReader: length prefix format, extension filtering, attachments - Edge cases: 100-bulk test, unicode, very long subjects, CRLF line endings, binary-only attachments, nested multipart, mixed .eml/.emlx in same directory Production fixes: - _extract_rtf_text method 3: replaced \{...\} group removal with direct brace stripping + control word removal — preserves text inside RTF groups instead of deleting it entirely - _extract_attachment_text: accepts optional content_type parameter for extensionless files - _collect_attachments: passes content_type to extraction; stores raw bytes for direct RTF fallback - _build_email_document: RTF body fallback tries direct extraction when extracted_text is empty (handles extensionless rtf-body attachments from PST exports)

LiamDGray added 3 commits May 30, 2026 12:31

LiamDGray closed this May 30, 2026

LiamDGray deleted the feat/comprehensive-eml-tests branch May 30, 2026 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: comprehensive email reader test suite (113 tests) + 3 bugfixes#343

test: comprehensive email reader test suite (113 tests) + 3 bugfixes#343
LiamDGray wants to merge 3 commits into
StarTrail-org:mainfrom
LiamDGray:feat/comprehensive-eml-tests

LiamDGray commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LiamDGray commented May 30, 2026

Summary

Areas Covered

Bug Fixes

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant