1

Through other SO questions I've found how to get headlines but I don't know where the Google code stores the links.

I'm wanting a 2 column data.frame of the headlines and their corresponding links.

library(rvest)
library(tidyverse)


dat <- read_html("https://news.google.com/search?q=coronavirus&hl=en-US&gl=US&ceid=US%3Aen") %>%
  html_nodes('.DY5T1d') %>% #
  html_text()

dat
SCDCE
  • 1,164
  • 1
  • 9
  • 23
  • Google is a bit difficult to scrape. :) All links should be save in "href". If you have some difficult, maybe you should use the Rselenium. In this way you will be able to navigate the web site. – Earl Mascetti Mar 05 '20 at 15:22
  • I found the description reference in the source code but still no idea what the links are stored under – SCDCE Mar 05 '20 at 15:36
  • Did you try to follow this https://stackoverflow.com/questions/35247033/using-rvest-to-extract-links ? – Earl Mascetti Mar 05 '20 at 15:41

1 Answers1

3

After a lot of inspecting the Google web code I found what I was looking for. I also came across the descriptions so I basically re-built the Google news RSS feed.

library(rvest)
library(tidyverse)


news <- function(term) {

  html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))

  dat <- data.frame(Link = html_dat %>%
                      html_nodes('.VDXfz') %>% 
                      html_attr('href')) %>% 
    mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))

  news_dat <- data.frame(
    Title = html_dat %>%
      html_nodes('.DY5T1d') %>% 
      html_text(),
    Link = dat$Link,
    Description =  html_dat %>%
      html_nodes('.Rai5ob') %>% 
      html_text()
  )

  return(news_dat)
}

news("coronavirus")
SCDCE
  • 1,164
  • 1
  • 9
  • 23
  • 1
    Scraping tip, Your code above is calling `read_html(url)` twice inside the function. You should read the webpage using `page – Dave2e Mar 05 '20 at 17:49
  • The only downside with this, is that the classes could change at any time and crash your program :/ – Justin Dalrymple May 21 '20 at 17:27