R: scraping a news website & authentication to extract full content

Question

I am trying to scrape news articles like this one: https://www.lefigaro.fr/vox/societe/luc-ferry-une-convention-climat-pluraliste-20191016 . I have a paid subscription and I am trying to log in to the page with my own credentials to scrape the full content of this article. However, even though I manage to fill in the generic login form, I only retrieve the free paragraphs via R. The content I see when I login via Chrome does show the full text.

library(rvest) 

#Address of the generic login webpage
login<-"https://connect.lefigaro.fr/login?client=horizon_web&redirect_uri=https://www.lefigaro.fr/"

#create a web session with my credentials
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[1]] 
filled_form<-set_values(pgform, email="*******", password="*******")
submit_form(pgsession, filled_form) #this seems to work 

url <- "https://www.lefigaro.fr/vox/societe/luc-ferry-une-convention-climat-pluraliste-20191016"
p <- read_html(url)
title <- p %>% html_nodes(".fig-headline--premium") %>% html_text(trim = TRUE) #title 
time <- p %>% html_nodes("time") %>% html_text(trim = TRUE) #date 
time <- time[[1]]
body <- toString(p %>% html_nodes(".fig-paragraph")%>% html_text(trim = TRUE))
body #i do not get the full text which I do see on my browser as a subscriber

It looks like you created a proper session, now instead of `read_html(url)`, try the `jump_to(url)` function instead. I believe what is happening is when you are calling `read_html()` you are resetting your "browser" connection and losing the login information created in the session. — Dave2e, Apr 02 '21 at 13:03
Sorry, you need to pass the session information. Try this `jump_to(pgsession, url)` and if this does work, then try: `resp — Dave2e, Apr 02 '21 at 16:16

R: scraping a news website & authentication to extract full content

0 Answers0