7

Objects can be saved and read like so

# Save as file
saveRDS(iris, "mydata.RDS")

# Read back in 
readRDS("mydata.RDS")

But this doesn't seem to work for objects made with xml2::read_html()

Example

library(rvest)
someobject <- read_html("https://stackoverflow.com/")
saveRDS(someobject, "someobject.RDS")

Which creates a file, but not as expected i.e.

readRDS("someobject.RDS")
Error in doc_is_html(x$doc) : external pointer is not valid

What's going on and what's the simplest way of saving an html object so that it can be loaded back in with minimal code/fuss?

neilfws
  • 26,280
  • 5
  • 44
  • 53
stevec
  • 15,490
  • 6
  • 67
  • 110

4 Answers4

6

To answer "what's going on": saveRDS is trying to serialize the object being saved. Here, the object someobject is a list with elements someobject$doc and someobject$node. The type of the elements is externalptr (external pointer), which means they reference a C data structure held in memory. When external pointers are serialized, the reference is lost. Hence the error "external pointer is not valid".

You could serialize someobject using as.character() and pass that to saveRDS:

saveRDS(as.character(someobject), "someobject.RDS")

Then recreate the object using readRDS and read_html:

someobject <- read_html(readRDS("someobject.RDS"))

But it's easier to use write_html() as others suggested.

Some discussion in this Github issue thread.

neilfws
  • 26,280
  • 5
  • 44
  • 53
3

We can use write_xml and read_html from xml2 package

before <- read_html("https://stackoverflow.com/")
xml2::write_xml(before, "someobject1.xml")
after <- xml2::read_html("someobject1.xml")

However, identical returns FALSE

identical(before, after)
#[1] FALSE

but the query on both of them seem to return the same result

library(rvest)
before %>%  html_nodes("div")
after %>% html_nodes("div")
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
3

As far as I know the methods using XML and RDS files seem to be off by the same number of characters. I did a comparison and it seems like the differences between the original and the loaded version are in the body nodes.

url <-  "https://stackoverflow.com/"
html <- read_match(url)
html_node(html, "body")  %>% html_text() %>%  unlist() -> OBT
nchar(OBT)

28879

xml2::write_xml(html, "someobject1.xml")
html_node(html, "body")  %>% html_text() %>%  unlist() -> BT1
nchar(BT1)

28893

html   %>% toString %>% saveRDS(., "someobject.RDS")
after2 <- readRDS("someobject.RDS") %>% read_html
html_node(html, "body")  %>% html_text() %>%  unlist()-> BT2
nchar(BT2)

28893

This shows that the two loaded objects have the same number of characters. If we remove a "\n" characters from all text objects the number should be the same.

BT1 %>% str_remove_all(.,"\n") %>% nchar(.)

27733

BT2 %>% str_remove_all(.,"\n") %>% nchar(.) 

27733

OBT %>% str_remove_all(.,"\n") %>% nchar(.) 

27733

SignorCasa
  • 31
  • 3
  • Nice investigation, can you provide the results of the code in the bottom block? – stevec Feb 18 '21 at 03:25
  • 2
    I added to results, and I added the url because the results depend on the URL. – SignorCasa Feb 18 '21 at 03:37
  • 2
    Interesting. Looks like `write_xml()` might add those line breaks. You might be able to do [this](https://stackoverflow.com/a/58898098/5783745) to see if there's a pattern to where the `\n` characters are being inserted? – stevec Feb 18 '21 at 03:39
  • I had a quick look, and it seems to do with the layout. For example at the bottom of the page is table with four columns: Stack overflow, Products, Company and Stack Exchange Network. If you click on the bottom one on of the fourth column(Others), and the click Technology. Everyone of the column headers (except the most left one) will have an added "\n". – SignorCasa Feb 18 '21 at 04:37
0

Use toString() to convert xml_document class to character before saving, like so

library(rvest)
someobject <- read_html("https://stackoverflow.com/")

someobject  %>% toString %>% saveRDS(., "someobject.RDS")
newobject <- readRDS("someobject.RDS") %>% read_html

Note that these objects are not perfectly identical (I am not sure why).

identical(someobject, newobject)
# [1] FALSE
stevec
  • 15,490
  • 6
  • 67
  • 110