gh-128110: Fix rfc2047 handling in email parser address headers#130749
gh-128110: Fix rfc2047 handling in email parser address headers#130749medmunds wants to merge 1 commit intopython:mainfrom
Conversation
RFC 2047 Section 6.2 requires that "any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored." The modern header value parser correctly implements that for unstructured headers, but had missed a case in structured headers. This could cause a parsed address header to include extraneous spaces in a display-name. Fixed in get_atom() by converting a trailing CFWSList token after an encoded-word to an EWWhiteSpaceTerminal if another encoded-word follows. Deliberately left similar code in get_dotatom() unmodified. A dotatom can only appear within an addr-spec. RFC 2047 Section 5 prohibits use of an encoded-word in any portion of an addr-spec, so its appearance in a dotatom is invalid. Adding (and testing) special white-space handling in an invalid dotatom seems an unnecessary complication.
|
This PR is stale because it has been open for 30 days with no activity. |
bitdancer
left a comment
There was a problem hiding this comment.
The fact that you put the tests in get_phrase points to the fact that it's really get_phrase that is the locus of the bug. That's where the ews can end up next to each other. Here is a fix to get_phrase that passes all your tests. The if is complex, but that's because the circumstances where this situation comes up is very specific.
@@ -1473,6 +1473,16 @@ def get_phrase(value):
else:
try:
token, value = get_word(value)
+ if (token[0].token_type == 'encoded-word'
+ and phrase
+ and phrase[-1].token_type == 'atom'
+ and len(phrase[-1]) > 1
+ and phrase[-1][-2].token_type == 'encoded-word'
+ and phrase[-1][-1].token_type == 'cfws'
+ and not phrase[-1][-1].comments
+ ):
+ # linear ws between ews needs special handing...
+ phrase[-1][-1] = EWWhiteSpaceTerminal(phrase[-1], 'fws')
except errors.HeaderParseError:
if value[0] in CFWS_LEADER:
token, value = get_cfws(value)
This is dependent on the fact that "subsequent" atoms will never have leading whitespace because that's been consumed already. I don't think it's worth adding extra code for the possibility of leading whitespace because the parser won't produce it. It's a bit of parser fragility in the face of code changes, but I think that's a minor concern given the parser design (which is that it consumes whitespace greedily)
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
RFC 2047 Section 6.2 requires that "any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored." The modern header value parser correctly implements that for unstructured headers, but had missed a case in structured headers. This could cause a parsed address header to include extraneous spaces in a display-name.
Fixed in get_atom() by converting a trailing CFWSList token after an encoded-word to an EWWhiteSpaceTerminal if another encoded-word follows.
Deliberately left similar code in get_dotatom() unmodified. A dotatom can only appear within an addr-spec. RFC 2047 Section 5 prohibits use of an encoded-word in any portion of an addr-spec, so its appearance in a dotatom is invalid. Adding (and testing) special white-space handling in an invalid dotatom seems an unnecessary complication.
Fixes gh-128110
Suggest label: topic-email