6

I am trying to scrape a page on a website that requires a login and am consitently getting a 403 Error.

I have modified the code from these 2 posts for my site, Using rvest or httr to log in to non-standard forms on a webpage and how to reuse a session to avoid repeated login when scraping with rvest?

library(rvest)
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1")
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session

When the code is run, I get this message:

Submitting with 'NULL'
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode,  :
  Forbidden (HTTP 403).

I have also run the code this way, by updating user_agent as R.S. suggested in the comments, however, I receive the same error as above.

library(rvest)
library(httr)
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session

If you pull the page up without logging in, it shows you a bit of the data table at the bottom right below the text: "Earnings Events Available: 65"

Once logged in, it will show all 65 events and the table will be filled in which is what I want to download. I have all the code necessary to do that in place but am stuck just on the login part.

Thank you for your help.

Community
  • 1
  • 1
mks212
  • 773
  • 1
  • 15
  • 35
  • 1
    Shouldnt `submit_form(pgsession, pgform)` be `submit_form(pgsession, filled_form)` – Chirayu Chamoli Oct 25 '16 at 10:29
  • have you tried setting/altering the user -agent ? Edit: And you definitely need to call submit_form with filled_form, as @Chirayu says – R.S. Oct 25 '16 at 11:25
  • @ChirayuChamoli, I have updated the error you pointed out and also the error message received. Thanks for pointing out my first bug. – mks212 Oct 25 '16 at 14:31
  • @R.S., yes I did per your suggestion using methods described in this post, http://stackoverflow.com/questions/31406503/whats-my-user-agent-when-i-parse-website-with-rvest-package-in-r – mks212 Oct 25 '16 at 14:52
  • 1
    I wonder if it might be because of hidden fields in that form, though I am not sure. BTW, have you tried selenium (through [RSelenium](https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html) ) ? I find it quite dependable where user interaction is involved. – R.S. Oct 25 '16 at 15:20
  • I am working on it with RSelenium as you suggested but stumbled onto a different issue, http://stackoverflow.com/questions/40251904/log-in-to-website-using-rselenium-phantomjs-in-r-multiple-instances-of-class – mks212 Oct 26 '16 at 01:26

2 Answers2

5

Using R.S.'s suggestion, I used RSelenium to log in successfully.

A quick note for fellow mac users on using either chrome or phantom. I am running El Capitan so had some issue getting the mac to recognize the paths to both of the bin files. Instead, I moved the bin files to /usr/local/bin and they ran without an issue.

Below is the code to do so:

library(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)

This can also be done with phantom,

library(RSelenium)

pJS <- phantom() # start phantomjs

appURL <- 'https://www.optionslam.com/accounts/login/'
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)
mks212
  • 773
  • 1
  • 15
  • 35
1

Here's the answer to solve the problem in the original use case with rvest:

   library(rvest)
   library(httr)
   uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"

   pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))

   pgform <- html_form(pgsession)[[1]]

   filled_form <- set_values(pgform,
                             username = 'un',
                             password = 'ps')

   s <- submit_form(pgsession, filled_form, submit = NULL, config(referer = pgsession$url)) # s is your logged in session

The requested requires knowledge of the page you've come from (the referer(sic)).

config(referer = pgsession$url)
Ross Ireland
  • 135
  • 1
  • 5