test: comprehensive email reader test suite (113 tests) + 3 bugfixes#343
Closed
LiamDGray wants to merge 3 commits into
Closed
test: comprehensive email reader test suite (113 tests) + 3 bugfixes#343LiamDGray wants to merge 3 commits into
LiamDGray wants to merge 3 commits into
Conversation
- New EmlReader class parses .eml files from a directory - Refactored shared email parsing helpers (_payload_to_text, _parse_email_message) - Updated email_rag.py with --eml-path flag alongside existing --mail-path - Fixed relative import in email_rag.py so it runs directly - All existing EmlxReader functionality preserved
…eml reader Core features: - Reply/forward thread detection for Gmail, Outlook, and forwarded message formats - Main content prioritized over quoted/replied text in RAG indexing - Attachment text extraction for PDF (PyPDF2), Word (python-docx), TXT, RTF, CSV - RTF body fallback for PST exports (TUSD emails) — MS Exchange \htmlrtf0 markers - RFC 2047 subject decoding for international characters - BeautifulSoup HTML stripping for HTML-only emails All features use lazy imports with graceful fallbacks when optional deps are missing.
Coverage includes:
- Helper functions: _payload_to_text, _strip_html, _decode_subject,
_decode_addr, _split_quoted_thread, _extract_rtf_text,
_extract_attachment_text, _collect_attachments
- Thread detection: all 5 patterns (Gmail, Outlook, forward header
block, underscore line, Forwarded Message)
- Attachment extraction: PDF, DOCX, TXT, CSV, MD, JSON, XML, YAML, RTF
- Document building: simple, HTML, multipart, thread separation, RTF
body fallback, calendar invites, encoded subjects, edge cases
- EmlReader: single/multiple files, max_count, hidden dirs, malformed
files, nested directories, include_html, attachment extraction
- EmlxReader: length prefix format, extension filtering, attachments
- Edge cases: 100-bulk test, unicode, very long subjects, CRLF
line endings, binary-only attachments, nested multipart, mixed
.eml/.emlx in same directory
Production fixes:
- _extract_rtf_text method 3: replaced \{...\} group removal with
direct brace stripping + control word removal — preserves text
inside RTF groups instead of deleting it entirely
- _extract_attachment_text: accepts optional content_type parameter
for extensionless files
- _collect_attachments: passes content_type to extraction; stores
raw bytes for direct RTF fallback
- _build_email_document: RTF body fallback tries direct extraction
when extracted_text is empty (handles extensionless rtf-body
attachments from PST exports)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a comprehensive 113-test suite for the LEANN email reader, covering all helper functions, thread detection patterns, attachment extraction, document building, and EmlReader/EmlxReader integration — plus 3 bug fixes discovered during testing.
Areas Covered
Helpers:
_payload_to_text,_strip_html,_decode_subject,_decode_addr(32 tests)Thread Detection: All 5 reply/forward patterns — Gmail, Outlook, forward header block, underscore line, Forwarded Message (13 tests)
RTF Extraction: 6 tests across all 3 methods (\htmlrtf0, embedded HTML, generic strip)
Attachment Extraction: 14 tests covering PDF, DOCX, TXT, CSV, MD, JSON, XML, YAML, RTF
Document Building: 14 tests — simple, HTML, multipart, thread separation, RTF body fallback, calendar invites, encoded subjects, edge cases
EmlReader: 10 integration tests — single/multiple, max_count, hidden dirs, malformed files
EmlxReader: 5 integration tests — length prefix format, extension filtering
Edge Cases: 12 tests — unicode, CRLF, bulk 100, nested multipart, extensionless RTF attachments
Bug Fixes
_extract_rtf_textmethod 3 — was using\{[^}]*\}which deleted all text inside RTF groups. Replaced with direct brace removal + control word stripping to preserve actual content._extract_attachment_text— extensionlessrtf-bodyattachments from PST exports couldn't be parsed. Added optionalcontent_typeparameter with RTF fallback._build_email_document— RTF body fallback now does direct_extract_rtf_textwhen extracted text is empty, handling PST export attachments without file extensions.Test Plan
→ 113 passed, all green.