With rvest, how to extract html contents from the object returned by submit_form()

Question

I am trying to download some traffic data from pems.dot.ca.gov, following this topic.

rm(list=ls())
library(rvest)
library(xml2)
library(httr)
url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
                          'username' = 'omitted',
                          'password' = 'omitted')
resp = submit_form(pgsession, filled_form)
resp_2 = resp$response
cont = resp_2$content

I checked the class() of these items and found that the resp is a 'session', resp_2 is a 'response', and cont is 'raw'. My question is: how can I extract the html content correctly so that I can proceed with XPath to pick out the actual data I want from this page? My intuition is that I should parse the resp_2 which is a response, but I just can not make it work. Your help are highly appreciated!

Have you looked at Selector Gadget? I find it useful for finding specific parts of a web page that I want to extract. http://selectorgadget.com/ . It works well with `html_nodes` and `html_text` within `rvest` — Warner, Jul 31 '16 at 18:32
I just looked into the gadget and it seems cool. But my question is not about how to select stuff from html , it's about how to convert the response or the raw into html. Thanks for your answer anyway! — user3768495, Jul 31 '16 at 18:38
It appears the site requires a username and password to get past the opening screen. You code code above uses "omitted" which is not a valid combination. If you can post an example of the actual page which you are interested in, it would be more helpful. — Dave2e, Jul 31 '16 at 19:12
@Dave2e I used my login credentials in my code. I just didn't show them here in stackoverflow:) sorry about making the code not reproducible. I hope someone can give me hints on how to deal with the response or raw. Thank you! — user3768495, Jul 31 '16 at 22:18

score 3 · Accepted Answer · answered Jul 31 '16 at 23:57

3

This should do it:

pg <- content(resp$response)

html_nodes(pg, "table.inlayTable") %>% 
  html_table() -> tab

head(tab[[1]])
##                 X1      X2           X3           X4
## 1                          Data Quality Data Quality
## 2             Hour 8 Lanes   % Observed  % Estimated
## 3 05/24/2013 00:00   1,311           50            0
## 4 05/24/2013 01:00     729           50            0
## 5 05/24/2013 02:00     399           50            0
## 6 05/24/2013 03:00     487           50            0

(you'll obviously need to modify the column names)

answered Jul 31 '16 at 23:57

hrbrmstr

71,487
11
119
180

This is exactly what I need! Thank you @hrbrmstr, for answering this question and for getting your own login credentials :) – user3768495 Aug 01 '16 at 06:18
How did you know about the 'table.inlayTable' setting? It is really cool! When I Google this phrase, only two results were returned! Amazing that you know about it! – user3768495 Aug 01 '16 at 06:41
I guessed you needed the table on that page and that's the CSS selector for it. If you're going to scrape things from the web you need to read up on either CSS selectors or XPath selectors and get familiar with browser "Developer Tools" – hrbrmstr Aug 01 '16 at 11:46

alistaire · Answer 2 · 2016-11-21T20:24:56.380

2

You need httr::content, which parses a response into content, which in this case is HTML that can easily be parsed with rvest:

resp_2 %>% content()
## {xml_document}
## <html style="height: 100%">
## [1] <head>\n  <!-- public -->\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/     ## ...
## [2] <body class="yui-skin-sam public">\n  <div id="maincontainer" style="height: 100%">\n\n      \n\     ## ...

edited Nov 21 '16 at 20:24

answered Jul 31 '16 at 23:18

alistaire

38,696
4
60
94

Hi @alistaire, I think that's what I need. Thank you! How do I print the whole section to the console so I can take a close look at it? – user3768495 Aug 01 '16 at 00:02
`harvest::html_structure` can give you a quick look at the DOM, if you need. – alistaire Aug 01 '16 at 00:05
You can also parse as text and use `cat` to print (since it will be a long single string, the default print method will truncate): `resp_2 %>% content(as = 'text') %>% cat()`. While that's a nice way to see what you have, the default parsed version is better for extracting the parts you want (though you could get back to it by calling `read_html` on the text). – alistaire Aug 01 '16 at 02:23
And the first comment above was supposed to say `rvest::html_structure`, obviously. Autocorrect, sorry. – alistaire Aug 01 '16 at 02:26

With rvest, how to extract html contents from the object returned by submit_form()

2 Answers2

Linked