Skip to content

Use Content-Type charset when decoding download filenames (#3135)#4184

Open
linesight wants to merge 4 commits into
chromiumembedded:masterfrom
linesight:fix-3135-chinese-filename-charset
Open

Use Content-Type charset when decoding download filenames (#3135)#4184
linesight wants to merge 4 commits into
chromiumembedded:masterfrom
linesight:fix-3135-chinese-filename-charset

Conversation

@linesight

@linesight linesight commented May 31, 2026

Copy link
Copy Markdown

Fixes #3135

When a server's Content-Disposition carries non-ASCII filename bytes without an RFC 5987 filename*, HttpContentDisposition::DecodeWord needs a charset hint. CEF currently passes an empty string, so DecodeWord falls back through IsStringUTF8() to SysNativeMBToWide() — the platform locale. On macOS and Linux that produces garbled filenames whenever the server's encoding doesn't match the user's locale, most commonly Chinese sites that emit GBK filenames with Content-Type: ...; charset=gb2312.

The fix pulls the charset off the in-flight response headers and passes it as referrer_charset.

When no charset is declared, GetCharset is a no-op and behavior is identical to before.

linesight added 3 commits May 31, 2026 08:46
…ame decoding (fixes issue chromiumembedded#3135)

On macOS and Linux, when a server sends a non-UTF-8 Chinese filename in
Content-Disposition (e.g., GBK-encoded) without an RFC 5987 filename*
parameter, Chromium's DecodeWord() cannot decode it because it has no
charset hint and falls back to SysNativeMBToWide(). On macOS this assumes
UTF-8 input, producing garbled output.

Fix by extracting the charset from the HTTP Content-Type response header
(e.g., "text/html; charset=gb2312") and passing it as referrer_charset
to net::GenerateFileName(), which uses it via CharsetConverter to properly
decode the filename.
- Remove debug LOG(ERROR) from production code
- Rename test constants for clarity (kTestChineseDomain etc.)
- Add destroyed_ guard, request_context_ member, thread assertions
- Add content verification and factory unregister in test cleanup
- Remove GBK test case that cannot work through CefString API
  (CefString assumes UTF-8 and corrupts raw non-UTF-8 header bytes)
…overage

- ChineseFilenameRawUTF8: IsStringUTF8() is platform-independent; valid
  UTF-8 bytes are auto-detected regardless of system locale
- ChineseFilenameContentTypeCharset: clarify this exercises the charset
  extraction path but would pass even without the fix (UTF-8 auto-detection)
- Document that a true GBK regression guard is blocked by CefString API

@magreenblatt magreenblatt left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

  • ChineseFilenameContentTypeCharset: clarify this exercises the charset
    extraction path but would pass even without the fix (UTF-8 auto-detection)

As you mention, the test coverage in this PR doesn't actually cover the fix. The easiest option is to extend the existing CefTestServer API with a new method that takes all header data a single void* buffer.

  A new method on CefTestServerConnection:
  
  virtual void SendHttpResponseWithRawHeaders(const void* header_data,
                                              size_t header_data_size,
                                              const void* response_data,
                                              size_t response_data_size) = 0;

  header_data is the complete raw HTTP response header block including the status line, e.g.:

  HTTP/1.1 200 OK\r\n
  Content-Type: application/vnd.ms-excel; charset=gb2312\r\n
  Content-Disposition: attachment; filename=<raw GBK bytes>\r\n

  response_data is the response body. Both are void* + size_t pairs, which the CEF translator already handles (same pattern as
  the existing data/data_size parameters on SendHttp200Response and SendHttpResponse).

  The implementation in CefTestServerConnectionImpl uses the existing RawHttpResponse class from
  net/test/embedded_test_server/http_response.h, which takes a headers string and contents string and sends them directly to
  the HttpResponseDelegate via SendRawResponseHeaders() — no CefString conversion anywhere. It follows the same thread-bouncing
  pattern as the existing SendBasicHttpResponse private method.

  This enables the GBK download filename test: the test constructs a std::string containing raw GBK bytes in the
  Content-Disposition header and charset=gb2312 in the Content-Type header, passes it as header_data. The bytes reach
  Chromium's network stack intact, GetCharset() returns gb2312, the fix passes it to net::GenerateFileName(), and DecodeWord
  decodes the GBK filename correctly.

Comment thread libcef/browser/download_manager_delegate_impl.cc
Pass a real charset to net::GenerateFileName() instead of an empty string
so filenames in legacy encodings like GBK decode properly. Use the charset
from the Content-Type header when the server sends one, and otherwise fall
back to the profile's default charset (intl.charset_default), same as
Chrome's own download delegate does.

My original test didn't really test the fix: it sent UTF-8 bytes that get
auto-detected anyway, so it passed with or without the change. Replaced it
with one that sends raw GBK bytes through a real CefTestServer. That needed
a new SendHttpResponseWithRawHeaders() on CefTestServerConnection, because
the existing CefString-based header API mangles non-UTF-8 bytes before they
reach the wire.

Without the fix the filename comes through as "order.xls"; with it you get
the expected Chinese name. Ran it 100x to make sure it isn't flaky.
@linesight linesight requested a review from magreenblatt June 5, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CefDownloadHandler Download Chinese name file garbled

2 participants