10

Using rvest in R to scrape a web-page, I'd like to extract the equivalent of innerHTML from a node, in particular to change line-breaks into newlines before applying html_text.

Example of desired functionality:

library(rvest)
doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')
innerHTML(doc, ".pp")

Shall produce following output:

[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With rvest 0.2 this can be achieved through toString.XMLNode

# run under rvest 0.2
library(XML)
html('<html><p class="pp">First Line<br />Second Line</p>') %>% 
  html_node(".pp") %>% 
  toString.XMLNode
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

With the newer rvest 0.2.0.900 this does not work anymore.

# run under rvest 0.2.0.900
library(XML)
html_node(doc,".pp") %>% 
  toString.XMLNode
[1] "{xml_node}\n<p>\n[1] <br/>"

The desired functionality is generally available in the write_xml function of package xml2 on which rvest now depends - if only write_xml could give its output to a variable instead of insisting to write to a file. (also a textConnection is not accepted).

As a workaround I can temporarily write to a file:

# extract innerHTML, workaround: write/read to/from temp file
html_innerHTML <- function(x, css, xpath) {
  file <- tempfile()
  html_node(x,css) %>% write_xml(file)
  txt <- readLines(file, warn=FALSE)
  unlink(file)
  txt
}
html_innerHTML(doc, ".pp") 
[1] "<p class=\"pp\">First Line<br>Second Line</p>"

with this I can then for example transform the line break tags into new-line characters:

html_innerHTML(doc, ".pp") %>% 
  gsub("<br\\s*/?\\s*>","\n", .) %>%
  read_html %>%
  html_text
[1] "First Line\nSecond Line"

Is there a better way to do this with existing functions from e.g. rvest, xml2, XML or other packages? In particular I'd like to avoid to write to the hard disk.

javrucebo
  • 146
  • 6
  • 1
    Seems like filing an issue on github might be more productive... – hadley May 08 '15 at 20:40
  • For follow-up, this was added as an issue and [eventually resolved](https://github.com/hadley/rvest/issues/87). The answer is simply to use `as.character`. – r2evans Apr 11 '16 at 00:53

2 Answers2

2

As @r2evans noted, as.character(doc) is the solution.

Regarding you last code snippet, which wants to extract the <br>-separated text out of the node while converting <br> to newline, there is a workaround in the currently unresolved rvest issue #175, comment #2:

The simplified version for this problem:

doc <- read_html('<html><p class="pp">First Line<br />Second Line</p>')

# r2evan's solution:
as.character(rvest::html_node(doc, xpath="//p"))
##[1] "<p class=\"pp\">First Line<br>Second Line</p>"

# rentrop@github's solution, simplified:
innerHTML <- function(x, trim = FALSE, collapse = "\n"){
    paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
innerHTML(doc)
## [1] "First Line\nSecond Line"
akraf
  • 2,537
  • 15
  • 36
0

Here is the solution using rvest 0.3.5:

doc <- xml2::read_html('<html><p class="pp">First Line<br />Second Line</p>')

nodes <- rvest::html_nodes(doc, css = '.pp')
# {xml_nodeset (1)}
# [1] <p class="pp">First Line<br>Second Line</p>

rvest::html_text(nodes)
# [1] "First LineSecond Line"
bhakyuz
  • 19
  • 4
  • This misses the question, which wasn't to get the node's text, but to be able to get the text with the linebreak accounted for – camille Dec 20 '20 at 15:25