0

I am trying to scrape news articles like this one: https://www.lefigaro.fr/vox/societe/luc-ferry-une-convention-climat-pluraliste-20191016 . I have a paid subscription and I am trying to log in to the page with my own credentials to scrape the full content of this article. However, even though I manage to fill in the generic login form, I only retrieve the free paragraphs via R. The content I see when I login via Chrome does show the full text.

library(rvest) 

#Address of the generic login webpage
login<-"https://connect.lefigaro.fr/login?client=horizon_web&redirect_uri=https://www.lefigaro.fr/"

#create a web session with my credentials
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[1]] 
filled_form<-set_values(pgform, email="*******", password="*******")
submit_form(pgsession, filled_form) #this seems to work 

url <- "https://www.lefigaro.fr/vox/societe/luc-ferry-une-convention-climat-pluraliste-20191016"
p <- read_html(url)
title <- p %>% html_nodes(".fig-headline--premium") %>% html_text(trim = TRUE) #title 
time <- p %>% html_nodes("time") %>% html_text(trim = TRUE) #date 
time <- time[[1]]
body <- toString(p %>% html_nodes(".fig-paragraph")%>% html_text(trim = TRUE))
body #i do not get the full text which I do see on my browser as a subscriber 

  • 1
    It looks like you created a proper session, now instead of `read_html(url)`, try the `jump_to(url)` function instead. I believe what is happening is when you are calling `read_html()` you are resetting your "browser" connection and losing the login information created in the session. – Dave2e Apr 02 '21 at 13:03
  • Thanks @Dave2e ! When I try p – scarlett rouge Apr 02 '21 at 15:56
  • 1
    Sorry, you need to pass the session information. Try this `jump_to(pgsession, url)` and if this does work, then try: `resp – Dave2e Apr 02 '21 at 16:16
  • @Dave2e this works! thanks so much! – scarlett rouge Apr 04 '21 at 11:05

0 Answers0