16

I'm trying to use the rvest package to scrape data from a web page. In a simple format, the html code looks like this:

<div class="style">
   <input id="a" value="123">
   <input id="b">
</div>

I want to get the value 123 from the first input. I tried the following R code:

library(rvest)
url<-"xxx"
output<-html_nodes(url, ".style input")

This will return a list of input tags:

[[1]]
<input id="a" value="123">
[[2]]
<input id="b">

Next I tried using html_node to reference the first input tag by id:

html_node(output, "#a")

Here it returned a list of nulls instead of the input tag I want.

[[1]]
NULL
[[2]]
NULL

My question is, how can I reference the input tag using its id?

Vegebird
  • 181
  • 1
  • 1
  • 4

3 Answers3

31

You can use xpath:

require(rvest)
text <- '<div class="style">
   <input id="a" value="123">
   <input id="b">
</div>'

h <- read_html(text)

h %>% 
  html_nodes(xpath = '//*[@id="a"]') %>%
  xml_attr("value")

The easiest way to get css- and xpath-selector is to use http://selectorgadget.com/. For a specific attribute like yours use chrome's developer toolbar to get the xpath as follows: enter image description here

andschar
  • 1,703
  • 1
  • 14
  • 26
Rentrop
  • 18,602
  • 6
  • 64
  • 93
3

This will work just fine with straight CSS selectors:

library(rvest)

doc <- '<div class="style">
   <input id="a" value="123">
   <input id="b">
</div>'

pg <- html(doc)
html_attr(html_nodes(pg, "div > input:first-of-type"), "value")

## [1] "123"
hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
1

Adding an answer bc I don't see the easy css selector shorthand for selecting by id: using #your_id_name:

h %>% 
  html_node('#a') %>%
  html_attr('value')

which outputs "123" as desired.

Same setup as the others:

require(rvest)
text <- '<div class="style">
   <input id="a" value="123">
   <input id="b">
</div>'

h <- read_html(text)
arvi1000
  • 8,197
  • 1
  • 29
  • 49