R scraping reviews from multiple pages on TripAdvisor

Question

I'm trying to pull out a few pages of reviews from TripAdvisor for a academic project.

Here's my attempt using R

#Load libraries
library(rvest)
library(RSelenium)

# main url for stadium
urlmainlist=c(
  hampdenpark="http://www.tripadvisor.com.ph/Attraction_Review-g186534-d214132-Reviews-Hampden_Park-Glasgow_Scotland.html"
)

# Specify how many search pages and counter
morepglist=list(
  hampdenpark=seq(10,360,10)
)
#----------------------------------------------------------------------------------------------------------

# create pickstadium variable
pickstadium="hampdenpark"


# get list of urllinks corresponding to different pages

# url link for first search page
urllinkmain=urlmainlist[pickstadium]
# counter for additional pages
morepg=as.numeric(morepglist[[pickstadium]])

urllinkpre=paste(strsplit(urllinkmain,"Reviews-")[[1]][1],"Reviews",sep="")
urllinkpost=strsplit(urllinkmain,"Reviews-")[[1]][2]

urllink=rep(NA,length(morepg)+1)

urllink[1]=urllinkmain
for(i in 1:length(morepg)){
  urllink[i+1]=paste(urllinkpre,"-or",morepg[i],"-",urllinkpost,sep="")
}
head(urllink)
write.csv(urllink,'urllink.csv')

##########
#SCRAPING#
##########

library(rvest)
library(RSelenium)
#install.packages('RSelenium')

testurl <- read.csv("urllink.csv", header=FALSE, quote="'", stringsAsFactors = F)
testurl=testurl[-1,]
testurl=testurl[,-1]
testurl=as.data.frame(testurl)
testurl=gsub('"',"",testurl$testurl)
list<-unlist(testurl)

tripadvisor <- NULL

#Scrape
for(i in 1:length(list)){

  reviews <- list[i] %>% 
    read_html() %>% 
    html_nodes("#REVIEWS .innerBubble")

  id <- reviews %>%
    html_node(".quote a") %>%
    html_attr("id")

  rating <- reviews %>%
    html_node(".rating .rating_s_fill") %>%
    html_attr("alt") %>%
    gsub(" of 5 stars", "", .) %>%
    as.integer()

  date <- reviews %>%
    html_node(".rating .ratingDate") %>%
    html_attr("title") %>%
    strptime("%b %d, %Y") %>%
    as.POSIXct()

  review <- reviews %>%
    html_node(".entry .partial_entry") %>%
    html_text()%>%
    as.character()

  rowthing <- data.frame(id, review,rating, date, stringsAsFactors = FALSE)
  tripadvisor<-rbind(rowthing, tripadvisor)
}

However this results in an empty tripadvisor dataframe. Any help on fixing this would be appreciated.

Additional Question

I'd like to capture the full reviews, as my code currently intends to capture partial entries only. For each review, I'd like to automatically click on the 'More' link and then extract the full review.

Here too, any help would be grately appreciated.

Seems likely the TripAdvisor has a TOS the would prohibit this. — IRTFM, Nov 15 '19 at 23:35
What a great robots.txt file!!! I usually go to the TOS following the link at the bottoms of the typical webpages, but this is just gold! — IRTFM, Nov 16 '19 at 03:29

R scraping reviews from multiple pages on TripAdvisor

0 Answers0