2

I developed some codes to scraping traffic data based this topic.I need to scrape many pages after log in, but right now my codes seem repeatedly log in the site for each url. How can I ‘reuse’ the session to avoid repeated log in so that, hopefully, the codes can run faster? Here's the pseudo-code:

generateURL <- function(siteID){return siteURL}

scrapeContent <- function(siteURL, session, filled_form){return content}

mainPageURL <- 'http://pems.dot.ca.gov/'
pgsession <- html_session(mainPageURL)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_value(pgform, 'username'='myUserName', 'password'='myPW')

siteIDList = c(1,2,3)
vectorOfContent <- vector(mode='list', length=3) #to store all the content

i=1
for (siteID in siteIDList){
    url = generateURL(siteID)
    content = scrapeContent(url, pgsession, filled_form)
    vectorOfContent[[i]]=content
    i = i +1}

I read the rvest documnentation but there is no such details in it. My question: How can I ‘reuse’ the session to avoid repeated log in? Thanks!

Community
  • 1
  • 1
user3768495
  • 2,701
  • 6
  • 22
  • 50

1 Answers1

2

You can do something like this:

require(rvest)
pgsession <- html_session(mainPageURL)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_value(pgform, 'username'='myUserName', 'password'='myPW')
s <- submit_form(pgsession, pgform) # s is your logged in session

vectorOfContent <- vector(mode='list', length=3)

for (siteID in siteIDList){
  url <- generateURL(siteID)
  # jump_to navigates within the session, read_html parses the html
  vectorOfContent[[siteID]]=s %>% jump_to(generateURL) %>% read_html()
  }
Rentrop
  • 18,602
  • 6
  • 64
  • 93