8

The read_html function generates an xml_document which i would like to save and later on load it to parse it.

The problem is that after loading the xml_document there is no html within it.

library(rvest)
library(magrittr)
doc <- read_html("http://www.example.com/")
doc %>% html_node("h1") %>% html_text

I get: [1] "Example Domain"

But when I save first the xml_document doc object and load it again it seems that everything has been cleared.

save(doc, file=paste0(getwd(), "/example.RData"))
rm(doc)

load(file=paste0(getwd(), "/example.RData"))
doc %>% html_node("h1") %>% html_text

I get: Error: No matches

Or when i run doc i get: {xml_document} an empty xml_document.

It is also the case that when i run the doc, after having loaded it, i get a message that RStudio has stopped working.

I have tried it on two different windows machines, got the same problem.

sessionInfo()

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5     rvest_0.3.1.9000 xml2_0.1.2      

loaded via a namespace (and not attached):
[1] httr_1.1.0  R6_2.1.2    tools_3.3.0 Rcpp_0.12.5
rafa.pereira
  • 10,729
  • 4
  • 59
  • 88
dimitris_ps
  • 5,391
  • 1
  • 21
  • 46

3 Answers3

4

I have found a workaround, not very efficient but it does the job.

The logic is to save the xml_document as a string and read it in again with read_html.

library(rvest)
library(magrittr)
doc <- read_html("http://www.example.com/")

# convert it to character
doc %<>% as("character")

save(doc, file=paste0(getwd(), "/example.RData"))
rm(doc)

load(file=paste0(getwd(), "/example.RData"))
doc %>% read_html %>% html_node("h1") %>% html_text
dimitris_ps
  • 5,391
  • 1
  • 21
  • 46
4

I wrote some ad hoc functions to accomplish this task. They are slightly better than the previous answer because they work for lists of rvest objects and they use RDS instead of RData files. This allows one to name the object anything one wants.

write_rvest = function(x, path, ...) {
  #convert to character
  #is list?
  if (is.list(x)) {
    x %<>% map(as.character)
  } else {
    x %<>% as.character
  }

  #save
  write_rds(x, path = path, ...)
}

read_rvest = function(path) {
  #load from file
  x = read_rds(path)

  #read
  if (is.list(x)) {
    x %<>% map(read_html)
  } else {
    x %<>% read_html
  }

  x
}

Tests for equality work but fail for identity. Nevertheless, the objects work and they have the same size in bytes, so I don't know why identity fails. Maybe it's sensitive to RAM position.

CoderGuy123
  • 5,189
  • 3
  • 48
  • 77
1

Here the same wraper-function as by Deleet in base R code.

library(rvest)

write_rvest = function(x, file, ...) {
  #convert to character
  #is list?
  if (is.list(x)) {
    x = Map(as.character, x)
  } else {
    x = as.character(x)
  }

  #save
  saveRDS(x, file = file, ...)
}

read_rvest = function(file) {
  #load from file
  x = readRDS(file)

  #read
  if (is.list(x)) {
    x <- Map(read_html, x)
  } else {
    x <- read_html(x)
  }

  x
}
andschar
  • 1,703
  • 1
  • 14
  • 26