8

I am new to R and rvest. I am trying to use these to get information from a website (www.medicinescomplete.com) that allows sign in using the Athens academic login system. In a browser, when you click on the athens login button it transfers you to an athens login form. After submitting the user credentials the form then redirects the browser back to the original site but logged in.

I used the submit_form() function to submit the credentials into the athens form and this returns a 200 code. However, R does not follow the redirect as a browser would and if I use the jump_to() command to return to the original site it is not logged in. I suspect that the redirected link returned by the sign in page might contain the log in credentials I need but I do not know how to find the link and send it using rvest

Has anyone worked out how to log in via athens using rvest or has any idea about how to make it follow an automatic redirect??

The code I have used to get this far is (login credentials changed):

library(rvest)
library(magrittr)

url <- "https://www.medicinescomplete.com/about/"
mcsession <- html_session(url)
mcsession <- jump_to(mcsession, "/mc/athens.htm?   uri=https%3A%2F%2Fwww.medicinescomplete.com%2Fabout%2F")
athensform <- html_form(mcsession)[[1]]
athensform <-set_values(athensform, ath_uname = "xxx", ath_passwd = "yyy")
submit_form(mcsession, athensform)
jump_to(mcsession, "https://www.medicinescomplete.com/mc/bnf/current/")

I get 200 code for the submit_form() step but a 403 forbidden code for the jump_to() last line.

I then piped the submit_form step into html() and printed it. From what I could make out it was a successful login but in the body of the main page there is a line referring to redirecting back to the original site. The html for the whole page is too long to post but the relevant bit seems to be:

<div style="padding: 8px;" id="logindiv">
                        <form method="POST" action="https://www.medicinescomplete.com/mc/athens">
                            Please wait while we transfer you. <br><noscript>JavaScript disabled, please<input type="submit" value="click here" style="border:none;background:none;text-decoration:underline;color:#E27B2F;">

And I wonder if this following bit refers to some login key:

<input type="hidden" name="TARGET" value="https://www.medicinescomplete.com/about/" style="display:none"><input type="hidden" name="RelayState" value="https://www.medicinescomplete.com/about/" style="display:none"><input type="hidden" name="SAMLResponse" value="PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6cHJvdG9jb2wiIHhtbG5zOnNhbWwyPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6YXNzZXJ0aW9uIiBEZXN...

Aha! Further down the page there is this:

<script>
window.onload = function() { document.forms[0].submit(); }
</script>

I think the window is meant to automatically submit another form that performs the post to the original medicinescomplete.com site to authenticate using the hidden field as a login credential. However, on trying to use the submit_form() on this page I don't seem to get any further! I have added the following line to try and work out what is going on:

> submit_form(mcsession, athensform) %>% html_form() %>% str()

And this gives the following output:

Submitting with 'submit'
List of 1
 $ :List of 5
  ..$ name   : chr "<unnamed>"
  ..$ method : chr "POST"
  ..$ url    : chr "https://www.medicinescomplete.com/mc/athens"
  ..$ enctype: chr "form"
  ..$ fields :List of 4
  .. ..$ NULL        :List of 7
  .. .. ..$ name    : NULL
  .. .. ..$ type    : chr "submit"
  .. .. ..$ value   : chr "click here"
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..$ TARGET      :List of 7
  .. .. ..$ name    : chr "TARGET"
  .. .. ..$ type    : chr "hidden"
  .. .. ..$ value   : chr "https://www.medicinescomplete.com/about/"
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..$ RelayState  :List of 7
  .. .. ..$ name    : chr "RelayState"
  .. .. ..$ type    : chr "hidden"
  .. .. ..$ value   : chr "https://www.medicinescomplete.com/about/"
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..$ SAMLResponse:List of 7
  .. .. ..$ name    : chr "SAMLResponse"
  .. .. ..$ type    : chr "hidden"
  .. .. ..$ value   : chr "PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6cHJvdG9jb2wiIHhtbG5zOnNhbWwyPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA"| __truncated__
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..- attr(*, "class")= chr "fields"
  ..- attr(*, "class")= chr "form"

I feel like the information in this form should allow me to log in to the original site but I don't quite understand how! Unfortunately when I try the submit_form() function again with this form it doesn't seem to work. I tried this:

submit_form(mcsession, athensform) %>% html_form() %>% submit_form(mcsession, .) %>% html()

And got this:

Submitting with 'submit'
Submitting with ''
Error in if (!(submit %in% names(submits))) { : 
  argument is of length zero
iProcrastinate
  • 121
  • 2
  • 7
  • Some code might help. The question 'how to make a great reproducible example' is well worth a read. – r.bot Apr 01 '15 at 11:59
  • Have you tried using `getURL()` from the `RCurl` package, q.v. [this SO article](http://stackoverflow.com/questions/15168970/log-into-a-website-to-grab-the-data-using-rcurl) – Tim Biegeleisen Apr 01 '15 at 12:03
  • Functions in `rvest` and `httr` follow redirects by default, so I'd be curious what's in the the response that would be preventing either from doing so here. As @user2633645 said, we have no code to examine. Another alternative is to script a session in RSelenium to grab the same data. – hrbrmstr Apr 01 '15 at 12:14
  • RSelenium appears to have gotten past it. Thanks @hrbrmstr! – iProcrastinate Apr 01 '15 at 16:13
  • 2
    @iProcrastinate I'd be interested in seeing your RSelenium code for this. Could you share it here? – Peter Verbeet Apr 04 '15 at 07:42
  • `library(RSelenium)` `startServer()` `remDr – iProcrastinate Nov 03 '15 at 17:50

1 Answers1

2

It's very likely tied to this issue which prevents httr to issue the correct GET query on redirect.

It is a little hard to guess though, because you're missing a reproducible example or the complete verbose output of your query.

A workaround is to prevent the redirect with:

rvest::submit_form(...,
                   httr::config(followlocation = FALSE))
Antoine Lizée
  • 3,433
  • 1
  • 24
  • 32