web scraping multiple web pages with different directory strings in r with rvest

Question

I know there are a lot questions similar to this but I haven't seemed to find one that ask this (Please forgive me if I am wrong). I am trying to scrape a website for weather data and I was successful at doing so for one of the web pages. However, I would like to loop the process. I have looked at enter link description here enter link description here

but I don't believe they solve my problem..

The directory changes slightly at the end from http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=avgtto

  http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=pcpn

and so on.. How could I loop through them even though they aren't increasing by numbers?

Code:

nj_weather_data<-read_html("http://climate.rutgers.edu/stateclim_v1/nclimdiv/")
### Get info you want from web page###
hurr<-html_nodes(nj_weather_data,"#climdiv_table")
### Extract info and turn into dataframe###
precip_table<-as.data.frame(html_table(hurr))%>%
  select(-Rank)

You could extract (or copy/paste) values from the table with statistics (e.g. `maxt` from `onclick="submitForm('maxt');"`) and construct a link based on that. You can use that link to scrape the table. — Roman Luštrik, Aug 29 '18 at 19:24
@RomanLuštrik do you mind providing an example or a link of one. I'm a little confused by what you mean by this — NBE, Aug 29 '18 at 19:27

score 1 · Accepted Answer · answered Aug 29 '18 at 19:55

Assuming you want average T, minimum T, precipitation... Look at the way url changes when you click either in the table above the temperature table. This is done through javascript and in order to obtain that, you would have to load the page through some sort of (headless) browser such as phantomJS.

Another way is to just get the names for individual page and append it to the url and load the data.

library(rvest)

# notice the %s at the end - this is replaced by elements of cs in sprintf
# statement below
x <- "http://climate.rutgers.edu/stateclim_v1/nclimdiv/index.php?stn=NJ00&elem=%s"
cs <- c("mint", "avgt", "pcpn", "hdd", "cdd")

# you could paste together new url using paste, too
customstat <- sprintf(x, cs) # %s is replaced with mint, avgt...

# prepare empty object for results
out <- vector("list", length(customstat))
names(out) <- cs

# get individual table and insert it into the output
for (i in customstat) {
  out[[which(i == customstat)]] <- read_html(i) %>%
    html_nodes("#climdiv_table") %>%
    html_table() %>%
    .[[1]]
}

> str(out)
List of 5
 $ mint:'data.frame':   131 obs. of  15 variables:
  ..$ Rank  : logi [1:131] NA NA NA NA NA NA ...
  ..$ Year  : chr [1:131] "1895" "1896" "1897" "1898" ...
  ..$ Jan   : chr [1:131] "18.1" "18.6" "18.7" "23.2" ...
  ..$ Feb   : chr [1:131] "11.7" "20.7" "22.5" "22.1" ...

You can now glue together tables (e.g. using do.call(rbind, out)) or whatever it is required for your analysis.

thanks for your answer! Quick question though.. where do you get the %s from ? — NBE, Aug 29 '18 at 20:01
@KWANGER url is something I constructed by adding that. `%s` is a special character needed further down in the code, see `?sprintf` (I also commented the code a bit, hope that helps). — Roman Luštrik, Aug 30 '18 at 12:28

web scraping multiple web pages with different directory strings in r with rvest

1 Answers1