0

I'm trying to pull out a few pages of reviews from TripAdvisor for a academic project.

Here's my attempt using R

#Load libraries
library(rvest)
library(RSelenium)

# main url for stadium
urlmainlist=c(
  hampdenpark="http://www.tripadvisor.com.ph/Attraction_Review-g186534-d214132-Reviews-Hampden_Park-Glasgow_Scotland.html"
)

# Specify how many search pages and counter
morepglist=list(
  hampdenpark=seq(10,360,10)
)
#----------------------------------------------------------------------------------------------------------

# create pickstadium variable
pickstadium="hampdenpark"


# get list of urllinks corresponding to different pages

# url link for first search page
urllinkmain=urlmainlist[pickstadium]
# counter for additional pages
morepg=as.numeric(morepglist[[pickstadium]])

urllinkpre=paste(strsplit(urllinkmain,"Reviews-")[[1]][1],"Reviews",sep="")
urllinkpost=strsplit(urllinkmain,"Reviews-")[[1]][2]

urllink=rep(NA,length(morepg)+1)

urllink[1]=urllinkmain
for(i in 1:length(morepg)){
  urllink[i+1]=paste(urllinkpre,"-or",morepg[i],"-",urllinkpost,sep="")
}
head(urllink)
write.csv(urllink,'urllink.csv')

##########
#SCRAPING#
##########

library(rvest)
library(RSelenium)
#install.packages('RSelenium')

testurl <- read.csv("urllink.csv", header=FALSE, quote="'", stringsAsFactors = F)
testurl=testurl[-1,]
testurl=testurl[,-1]
testurl=as.data.frame(testurl)
testurl=gsub('"',"",testurl$testurl)
list<-unlist(testurl)

tripadvisor <- NULL

#Scrape
for(i in 1:length(list)){

  reviews <- list[i] %>% 
    read_html() %>% 
    html_nodes("#REVIEWS .innerBubble")

  id <- reviews %>%
    html_node(".quote a") %>%
    html_attr("id")

  rating <- reviews %>%
    html_node(".rating .rating_s_fill") %>%
    html_attr("alt") %>%
    gsub(" of 5 stars", "", .) %>%
    as.integer()

  date <- reviews %>%
    html_node(".rating .ratingDate") %>%
    html_attr("title") %>%
    strptime("%b %d, %Y") %>%
    as.POSIXct()

  review <- reviews %>%
    html_node(".entry .partial_entry") %>%
    html_text()%>%
    as.character()

  rowthing <- data.frame(id, review,rating, date, stringsAsFactors = FALSE)
  tripadvisor<-rbind(rowthing, tripadvisor)
}

However this results in an empty tripadvisor dataframe. Any help on fixing this would be appreciated.

Additional Question

I'd like to capture the full reviews, as my code currently intends to capture partial entries only. For each review, I'd like to automatically click on the 'More' link and then extract the full review.

Here too, any help would be grately appreciated.

Varun
  • 993
  • 1
  • 10
  • 25

0 Answers0