How to scrape Google News results into a data.frame with rvest

Question

Through other SO questions I've found how to get headlines but I don't know where the Google code stores the links.

I'm wanting a 2 column data.frame of the headlines and their corresponding links.

library(rvest)
library(tidyverse)


dat <- read_html("https://news.google.com/search?q=coronavirus&hl=en-US&gl=US&ceid=US%3Aen") %>%
  html_nodes('.DY5T1d') %>% #
  html_text()

dat

Google is a bit difficult to scrape. :) All links should be save in "href". If you have some difficult, maybe you should use the Rselenium. In this way you will be able to navigate the web site. — Earl Mascetti, Mar 05 '20 at 15:22
I found the description reference in the source code but still no idea what the links are stored under — SCDCE, Mar 05 '20 at 15:36
Did you try to follow this https://stackoverflow.com/questions/35247033/using-rvest-to-extract-links ? — Earl Mascetti, Mar 05 '20 at 15:41

SCDCE · Accepted Answer · 2020-03-05T17:51:38.837

3

After a lot of inspecting the Google web code I found what I was looking for. I also came across the descriptions so I basically re-built the Google news RSS feed.

library(rvest)
library(tidyverse)


news <- function(term) {

  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link,
    Description =  html_dat %>%
      html_nodes('.Rai5ob') %>% 
      html_text()
  )

  return(news_dat)
}

news("coronavirus")

edited Mar 05 '20 at 17:51

answered Mar 05 '20 at 16:06

SCDCE

1,164
1
9
23

1

Scraping tip, Your code above is calling `read_html(url)` twice inside the function. You should read the webpage using `page – Dave2e Mar 05 '20 at 17:49
The only downside with this, is that the classes could change at any time and crash your program :/ – Justin Dalrymple May 21 '20 at 17:27

How to scrape Google News results into a data.frame with rvest

1 Answers1