Skip to content

read_html() doesn't report parsing failure on very very long lines #440

@hadley

Description

@hadley
library(xml2)

path <- tempfile()

long <- paste0("start", strrep("x", 12e6), "end")
nchar(long)
#> [1] 12000008

cat(
  "<html><body>\n<script type=\"application/json\">",
  long,
  "</script>\n</body></html>\n",
  file = path,
  sep = ""
)

html <- read_html(path)
xml <- read_xml(path)
#> Warning in read_xml.character(path): xmlSAX2Characters: huge text nod [2]
#> Error in read_xml.character(path): Extra content at the end of the document [5]

Created on 2024-02-27 with reprex v2.1.0

From tidyverse/rvest#399

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions