Skip to content

fix: default read_html() encoding to UTF-8#476

Merged
jeroen merged 3 commits into
mainfrom
copilot/fix-read-html-double-encoding
Jun 2, 2026
Merged

fix: default read_html() encoding to UTF-8#476
jeroen merged 3 commits into
mainfrom
copilot/fix-read-html-double-encoding

Conversation

Copilot AI commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

read_html(url) double-encoded UTF-8 characters on Windows (codepage 65001) because htmlReadMemory received NULL as the encoding hint, causing libxml2 to fall back to the system locale — producing garbage like "Ã\u0084pfel" instead of "Äpfel". Regression vs. xml2 1.3.6, introduced via the Rtools libxml2 upgrade.

Changes

  • R/xml_parse.R: Change encoding default from "" to "UTF-8" in read_html(), read_html.default(), and read_html.response(). Documents that declare <meta charset> are unaffected — libxml2's HTML parser still re-parses with the declared encoding.
  • tests/testthat/test-xml_parse.R: Add regression test — round-trips "\u00c4pfel" through charToRaw()read_html()xml_text() without explicit encoding.
  • NEWS.md: Changelog entry.
# Before (broken on Windows codepage 65001):
xml_text(xml_find_first(read_html(url), "//div"))
#> [1] "Ã\u0084pfel"   # double-encoded

# After:
xml_text(xml_find_first(read_html(url), "//div"))
#> [1] "Äpfel"          # correct

Copilot AI changed the title [WIP] Fix double encoding issue in read_html function fix: default read_html() encoding to UTF-8 Jun 2, 2026
Copilot finished work on behalf of jeroen June 2, 2026 16:49
Copilot AI requested a review from jeroen June 2, 2026 16:49
@jeroen jeroen marked this pull request as ready for review June 2, 2026 17:00
@jeroen jeroen merged commit aff40d3 into main Jun 2, 2026
23 checks passed
@jeroen jeroen deleted the copilot/fix-read-html-double-encoding branch June 2, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_html(url) double-encodes UTF-8 on Windows (codepage 65001)

2 participants