how to reuse a session to avoid repeated login when scraping with rvest?

Question

I developed some codes to scraping traffic data based this topic.I need to scrape many pages after log in, but right now my codes seem repeatedly log in the site for each url. How can I ‘reuse’ the session to avoid repeated log in so that, hopefully, the codes can run faster? Here's the pseudo-code:

generateURL <- function(siteID){return siteURL}

scrapeContent <- function(siteURL, session, filled_form){return content}

mainPageURL <- 'http://pems.dot.ca.gov/'
pgsession <- html_session(mainPageURL)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_value(pgform, 'username'='myUserName', 'password'='myPW')

siteIDList = c(1,2,3)
vectorOfContent <- vector(mode='list', length=3) #to store all the content

i=1
for (siteID in siteIDList){
    url = generateURL(siteID)
    content = scrapeContent(url, pgsession, filled_form)
    vectorOfContent[[i]]=content
    i = i +1}

I read the rvest documnentation but there is no such details in it. My question: How can I ‘reuse’ the session to avoid repeated log in? Thanks!

score 2 · Accepted Answer · answered Aug 04 '16 at 06:35

You can do something like this:

require(rvest)
pgsession <- html_session(mainPageURL)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_value(pgform, 'username'='myUserName', 'password'='myPW')
s <- submit_form(pgsession, pgform) # s is your logged in session

vectorOfContent <- vector(mode='list', length=3)

for (siteID in siteIDList){
  url <- generateURL(siteID)
  # jump_to navigates within the session, read_html parses the html
  vectorOfContent[[siteID]]=s %>% jump_to(generateURL) %>% read_html()
  }

Thank you! The jump_to() is essentially what I am looking for. Sorry for the late response. — user3768495, Aug 05 '16 at 16:47

how to reuse a session to avoid repeated login when scraping with rvest?

1 Answers1

Linked