R: rvest extracting innerHTML

Question

Using rvest in R to scrape a web-page, I'd like to extract the equivalent of innerHTML from a node, in particular to change line-breaks into newlines before applying html_text.

Example of desired functionality:

library(rvest)
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
innerHTML(doc, ".pp")

Shall produce following output:

[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With rvest 0.2 this can be achieved through toString.XMLNode

# run under rvest 0.2
library(XML)
html('<html><p class="pp">First Line<br />Second Line</p>') %>% 
  html_node(".pp") %>% 
  toString.XMLNode
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With the newer rvest 0.2.0.900 this does not work anymore.

# run under rvest 0.2.0.900
library(XML)
html_node(doc,".pp") %>% 
  toString.XMLNode
[1] "{xml_node}\n<p>\n[1] <br/>"

The desired functionality is generally available in the write_xml function of package xml2 on which rvest now depends - if only write_xml could give its output to a variable instead of insisting to write to a file. (also a textConnection is not accepted).

As a workaround I can temporarily write to a file:

# extract innerHTML, workaround: write/read to/from temp file
html_innerHTML <- function(x, css, xpath) {
  file <- tempfile()
  html_node(x,css) %>% write_xml(file)
  txt <- readLines(file, warn=FALSE)
  unlink(file)
  txt
}
html_innerHTML(doc, ".pp") 
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

with this I can then for example transform the line break tags into new-line characters:

html_innerHTML(doc, ".pp") %>% 
  gsub("<br\\s*/?\\s*>","\n", .) %>%
  read_html %>%
  html_text
[1] "First Line\nSecond Line"

Is there a better way to do this with existing functions from e.g. rvest, xml2, XML or other packages? In particular I'd like to avoid to write to the hard disk.

Seems like filing an issue on github might be more productive... — hadley, May 08 '15 at 20:40
For follow-up, this was added as an issue and [eventually resolved](https://github.com/hadley/rvest/issues/87). The answer is simply to use `as.character`. — r2evans, Apr 11 '16 at 00:53

score 2 · Answer 1 · answered Mar 05 '18 at 15:52

As @r2evans noted, as.character(doc) is the solution.

Regarding you last code snippet, which wants to extract the <br>-separated text out of the node while converting <br> to newline, there is a workaround in the currently unresolved rvest issue #175, comment #2:

The simplified version for this problem:

doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')

# r2evan's solution:
as.character(rvest::html_node(doc, xpath="//p"))
##[1] "<p class=\"pp\">First Line<br>Second Line</p>"

# rentrop@github's solution, simplified:
innerHTML <- function(x, trim = FALSE, collapse = "\n"){
    paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
innerHTML(doc)
## [1] "First Line\nSecond Line"

score 0 · Answer 2 · answered Jan 29 '20 at 23:16

0

Here is the solution using rvest 0.3.5:

doc <- xml2::read_html('<html><p class="pp">First Line<br />Second Line</p>')

nodes <- rvest::html_nodes(doc, css = '.pp')
# {xml_nodeset (1)}
# [1] <p class="pp">First Line<br>Second Line</p>

rvest::html_text(nodes)
# [1] "First LineSecond Line"

answered Jan 29 '20 at 23:16

bhakyuz

19
4

This misses the question, which wasn't to get the node's text, but to be able to get the text with the linebreak accounted for – camille Dec 20 '20 at 15:25

R: rvest extracting innerHTML

2 Answers2