2

I am trying to download some traffic data from pems.dot.ca.gov, following this topic.

rm(list=ls())
library(rvest)
library(xml2)
library(httr)
url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
                          'username' = 'omitted',
                          'password' = 'omitted')
resp = submit_form(pgsession, filled_form)
resp_2 = resp$response
cont = resp_2$content

I checked the class() of these items and found that the resp is a 'session', resp_2 is a 'response', and cont is 'raw'. My question is: how can I extract the html content correctly so that I can proceed with XPath to pick out the actual data I want from this page? My intuition is that I should parse the resp_2 which is a response, but I just can not make it work. Your help are highly appreciated!

Community
  • 1
  • 1
user3768495
  • 2,701
  • 6
  • 22
  • 50
  • Have you looked at Selector Gadget? I find it useful for finding specific parts of a web page that I want to extract. http://selectorgadget.com/ . It works well with `html_nodes` and `html_text` within `rvest` – Warner Jul 31 '16 at 18:32
  • I just looked into the gadget and it seems cool. But my question is not about how to select stuff from html , it's about how to convert the response or the raw into html. Thanks for your answer anyway! – user3768495 Jul 31 '16 at 18:38
  • It appears the site requires a username and password to get past the opening screen. You code code above uses "omitted" which is not a valid combination. If you can post an example of the actual page which you are interested in, it would be more helpful. – Dave2e Jul 31 '16 at 19:12
  • @Dave2e I used my login credentials in my code. I just didn't show them here in stackoverflow:) sorry about making the code not reproducible. I hope someone can give me hints on how to deal with the response or raw. Thank you! – user3768495 Jul 31 '16 at 22:18

2 Answers2

3

This should do it:

pg <- content(resp$response)

html_nodes(pg, "table.inlayTable") %>% 
  html_table() -> tab

head(tab[[1]])
##                 X1      X2           X3           X4
## 1                          Data Quality Data Quality
## 2             Hour 8 Lanes   % Observed  % Estimated
## 3 05/24/2013 00:00   1,311           50            0
## 4 05/24/2013 01:00     729           50            0
## 5 05/24/2013 02:00     399           50            0
## 6 05/24/2013 03:00     487           50            0

(you'll obviously need to modify the column names)

hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
  • This is exactly what I need! Thank you @hrbrmstr, for answering this question and for getting your own login credentials :) – user3768495 Aug 01 '16 at 06:18
  • How did you know about the 'table.inlayTable' setting? It is really cool! When I Google this phrase, only two results were returned! Amazing that you know about it! – user3768495 Aug 01 '16 at 06:41
  • I guessed you needed the table on that page and that's the CSS selector for it. If you're going to scrape things from the web you need to read up on either CSS selectors or XPath selectors and get familiar with browser "Developer Tools" – hrbrmstr Aug 01 '16 at 11:46
2

You need httr::content, which parses a response into content, which in this case is HTML that can easily be parsed with rvest:

resp_2 %>% content()
## {xml_document}
## <html style="height: 100%">
## [1] <head>\n  <!-- public -->\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/     ## ...
## [2] <body class="yui-skin-sam public">\n  <div id="maincontainer" style="height: 100%">\n\n      \n\     ## ...
alistaire
  • 38,696
  • 4
  • 60
  • 94
  • Hi @alistaire, I think that's what I need. Thank you! How do I print the whole section to the console so I can take a close look at it? – user3768495 Aug 01 '16 at 00:02
  • `harvest::html_structure` can give you a quick look at the DOM, if you need. – alistaire Aug 01 '16 at 00:05
  • You can also parse as text and use `cat` to print (since it will be a long single string, the default print method will truncate): `resp_2 %>% content(as = 'text') %>% cat()`. While that's a nice way to see what you have, the default parsed version is better for extracting the parts you want (though you could get back to it by calling `read_html` on the text). – alistaire Aug 01 '16 at 02:23
  • And the first comment above was supposed to say `rvest::html_structure`, obviously. Autocorrect, sorry. – alistaire Aug 01 '16 at 02:26