Skip to content

test: comprehensive email reader test suite (113 tests) + 3 bugfixes#343

Closed
LiamDGray wants to merge 3 commits into
StarTrail-org:mainfrom
LiamDGray:feat/comprehensive-eml-tests
Closed

test: comprehensive email reader test suite (113 tests) + 3 bugfixes#343
LiamDGray wants to merge 3 commits into
StarTrail-org:mainfrom
LiamDGray:feat/comprehensive-eml-tests

Conversation

@LiamDGray
Copy link
Copy Markdown

Summary

Adds a comprehensive 113-test suite for the LEANN email reader, covering all helper functions, thread detection patterns, attachment extraction, document building, and EmlReader/EmlxReader integration — plus 3 bug fixes discovered during testing.

Areas Covered

Helpers: _payload_to_text, _strip_html, _decode_subject, _decode_addr (32 tests)
Thread Detection: All 5 reply/forward patterns — Gmail, Outlook, forward header block, underscore line, Forwarded Message (13 tests)
RTF Extraction: 6 tests across all 3 methods (\htmlrtf0, embedded HTML, generic strip)
Attachment Extraction: 14 tests covering PDF, DOCX, TXT, CSV, MD, JSON, XML, YAML, RTF
Document Building: 14 tests — simple, HTML, multipart, thread separation, RTF body fallback, calendar invites, encoded subjects, edge cases
EmlReader: 10 integration tests — single/multiple, max_count, hidden dirs, malformed files
EmlxReader: 5 integration tests — length prefix format, extension filtering
Edge Cases: 12 tests — unicode, CRLF, bulk 100, nested multipart, extensionless RTF attachments

Bug Fixes

  1. _extract_rtf_text method 3 — was using \{[^}]*\} which deleted all text inside RTF groups. Replaced with direct brace removal + control word stripping to preserve actual content.
  2. _extract_attachment_text — extensionless rtf-body attachments from PST exports couldn't be parsed. Added optional content_type parameter with RTF fallback.
  3. _build_email_document — RTF body fallback now does direct _extract_rtf_text when extracted text is empty, handling PST export attachments without file extensions.

Test Plan

python -m pytest tests/test_email_readers.py -q

→ 113 passed, all green.

LiamDGray added 3 commits May 30, 2026 12:31
- New EmlReader class parses .eml files from a directory
- Refactored shared email parsing helpers (_payload_to_text, _parse_email_message)
- Updated email_rag.py with --eml-path flag alongside existing --mail-path
- Fixed relative import in email_rag.py so it runs directly
- All existing EmlxReader functionality preserved
…eml reader

Core features:
- Reply/forward thread detection for Gmail, Outlook, and forwarded message formats
- Main content prioritized over quoted/replied text in RAG indexing
- Attachment text extraction for PDF (PyPDF2), Word (python-docx), TXT, RTF, CSV
- RTF body fallback for PST exports (TUSD emails) — MS Exchange \htmlrtf0 markers
- RFC 2047 subject decoding for international characters
- BeautifulSoup HTML stripping for HTML-only emails

All features use lazy imports with graceful fallbacks when optional deps are missing.
Coverage includes:
- Helper functions: _payload_to_text, _strip_html, _decode_subject,
  _decode_addr, _split_quoted_thread, _extract_rtf_text,
  _extract_attachment_text, _collect_attachments
- Thread detection: all 5 patterns (Gmail, Outlook, forward header
  block, underscore line, Forwarded Message)
- Attachment extraction: PDF, DOCX, TXT, CSV, MD, JSON, XML, YAML, RTF
- Document building: simple, HTML, multipart, thread separation, RTF
  body fallback, calendar invites, encoded subjects, edge cases
- EmlReader: single/multiple files, max_count, hidden dirs, malformed
  files, nested directories, include_html, attachment extraction
- EmlxReader: length prefix format, extension filtering, attachments
- Edge cases: 100-bulk test, unicode, very long subjects, CRLF
  line endings, binary-only attachments, nested multipart, mixed
  .eml/.emlx in same directory

Production fixes:
- _extract_rtf_text method 3: replaced \{...\} group removal with
  direct brace stripping + control word removal — preserves text
  inside RTF groups instead of deleting it entirely
- _extract_attachment_text: accepts optional content_type parameter
  for extensionless files
- _collect_attachments: passes content_type to extraction; stores
  raw bytes for direct RTF fallback
- _build_email_document: RTF body fallback tries direct extraction
  when extracted_text is empty (handles extensionless rtf-body
  attachments from PST exports)
@LiamDGray LiamDGray closed this May 30, 2026
@LiamDGray LiamDGray deleted the feat/comprehensive-eml-tests branch May 30, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant