gh-128110: Fix rfc2047 handling in email parser address headers by medmunds · Pull Request #130749 · python/cpython

medmunds · 2025-03-01T22:37:02Z

RFC 2047 Section 6.2 requires that "any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored." The modern header value parser correctly implements that for unstructured headers, but had missed a case in structured headers. This could cause a parsed address header to include extraneous spaces in a display-name.

Fixed in get_atom() by converting a trailing CFWSList token after an encoded-word to an EWWhiteSpaceTerminal if another encoded-word follows.

Deliberately left similar code in get_dotatom() unmodified. A dotatom can only appear within an addr-spec. RFC 2047 Section 5 prohibits use of an encoded-word in any portion of an addr-spec, so its appearance in a dotatom is invalid. Adding (and testing) special white-space handling in an invalid dotatom seems an unnecessary complication.

Fixes gh-128110

Suggest label: topic-email

RFC 2047 Section 6.2 requires that "any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored." The modern header value parser correctly implements that for unstructured headers, but had missed a case in structured headers. This could cause a parsed address header to include extraneous spaces in a display-name. Fixed in get_atom() by converting a trailing CFWSList token after an encoded-word to an EWWhiteSpaceTerminal if another encoded-word follows. Deliberately left similar code in get_dotatom() unmodified. A dotatom can only appear within an addr-spec. RFC 2047 Section 5 prohibits use of an encoded-word in any portion of an addr-spec, so its appearance in a dotatom is invalid. Adding (and testing) special white-space handling in an invalid dotatom seems an unnecessary complication.

github-actions · 2026-04-22T00:22:48Z

This PR is stale because it has been open for 30 days with no activity.

bitdancer

The fact that you put the tests in get_phrase points to the fact that it's really get_phrase that is the locus of the bug. That's where the ews can end up next to each other. Here is a fix to get_phrase that passes all your tests. The if is complex, but that's because the circumstances where this situation comes up is very specific.

@@ -1473,6 +1473,16 @@ def get_phrase(value):
         else:
             try:
                 token, value = get_word(value)
+                if (token[0].token_type == 'encoded-word'
+                        and phrase
+                        and phrase[-1].token_type == 'atom'
+                        and len(phrase[-1]) > 1
+                        and phrase[-1][-2].token_type == 'encoded-word'
+                        and phrase[-1][-1].token_type == 'cfws'
+                        and not phrase[-1][-1].comments
+                    ):
+                    # linear ws between ews needs special handing...
+                    phrase[-1][-1] = EWWhiteSpaceTerminal(phrase[-1], 'fws')
             except errors.HeaderParseError:
                 if value[0] in CFWS_LEADER:
                     token, value = get_cfws(value)

This is dependent on the fact that "subsequent" atoms will never have leading whitespace because that's been consumed already. I don't think it's worth adding extra code for the possibility of leading whitespace because the parser won't produce it. It's a bit of parser fragility in the face of code changes, but I think that's a minor concern given the parser design (which is that it consumes whitespace greedily)

bedevere-app · 2026-05-05T21:41:48Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

medmunds requested a review from a team as a code owner March 1, 2025 22:37

bedevere-app Bot added the awaiting review label Mar 1, 2025

bedevere-app Bot mentioned this pull request Mar 1, 2025

email.parser can insert extraneous spaces when parsing rfc2047 headers with policy.default #128110

Open

ZeroIntensity added the topic-email label Mar 2, 2025

caje731 approved these changes Mar 12, 2025

View reviewed changes

bedevere-app Bot added awaiting core review and removed awaiting review labels Mar 12, 2025

github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 22, 2026

medmunds mentioned this pull request Apr 30, 2026

gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address #122477

Merged

bitdancer requested changes May 5, 2026

View reviewed changes

bedevere-app Bot added awaiting changes and removed awaiting core review labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-128110: Fix rfc2047 handling in email parser address headers#130749

gh-128110: Fix rfc2047 handling in email parser address headers#130749
medmunds wants to merge 1 commit intopython:mainfrom
medmunds:fix-issue-128110

medmunds commented Mar 1, 2025

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

bitdancer left a comment

Uh oh!

bedevere-app Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

medmunds commented Mar 1, 2025

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-app Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants