0

I am trying to scrape some artists lyrics from a website in order to do some wordclouds by artist later. The urls were generated to scrape every lyric from them using purrr map function. The code runs but after a while retuns the lyrics of only one artist. What i need to do to scrape all the lyrics and store them in an object?

Here is the code:

##=----------------------------------------------INSTALL PACKAGES---------------------------------------

#install.packages("tidyverse")

##=----------------------------------------------LIBRARIES----------------------------------------------

library(rvest)
library(stringr)
library(purrr)

##=----------------------------------------------FUNCTIONS----------------------------------------------

hash<-function(x)
{
  x<-read_html(x)%>%
    html_nodes("cnt-letra p402_premium, p")%>%
    html_text()
  x<-str_remove_all(x,"[:punct:]")
  x<-tolower(x)
  x<-iconv(x,to ="ASCII//TRANSLIT")
  x<-str_remove_all(x,"'")
}

##=----------------------------------------------MAIN CODE----------------------------------------------

url<-"https://www.letras.com/mais-acessadas/reggaeton/"

##url hashing
song<-read_html(url)%>%
  html_nodes("b")%>%
  html_text()

##url hashing
artist<-read_html(url)%>%
  html_nodes("li a span")%>%
  html_text()

#Strings Cleaning
artist_clean<-str_remove_all(artist,"[:punct:]")
artist_clean<-tolower(artist_clean)
artist_clean<-iconv(artist_clean,to ="ASCII//TRANSLIT")
artist_clean<-str_remove_all(artist_clean,"'")
artist_clean<-gsub(" ","-",artist_clean)


#Strings Cleaning
song_clean<-str_remove_all(song,"[:punct:]")
song_clean<-tolower(song_clean)
song_clean<-iconv(song_clean,to ="ASCII//TRANSLIT")
song_clean<-str_remove_all(song_clean,"'")
song_clean<-gsub(" ","-",song_clean)

home<-"https://letras.com"

##url generation
generated_urls<-paste(home, "/", artist_clean,"/", song_clean, sep = "")
generated_urls<-generated_urls[1:5]

x<-purrr::map(generated_urls,hash)

Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • When I run the code, I get the output from all of the requested urls. Could this be a weird rate limiting error from the server? – pgcudahy Dec 05 '19 at 06:54

1 Answers1

1

I'm not quite sure why it was repeating the same one, but if you pass the urls as names before running map, it produces the expected output:

generated_urls[1:5] %>%
  set_names() %>% 
  map(hash)

You can then access the lyrics by the url or the index, which may be more useful anyhow. Another way to approach this, which works, is to set your url as a column in a tibble and use tibble(url = generated_urls) %>% mutate(lyrics = map(generated_url)) or such.

GenesRus
  • 948
  • 4
  • 13