1

I have written a function that iterates through URLs and scrapes the data I need from each page.

library(xml2)
library(rvest)

The below creates a vector of the relevant URLs:

tripadvisor_urls <- c()

for (n in seq(0, 80, 10)) {
    url <- paste('https://www.tripadvisor.co.uk/Attraction_Review-g186306-d9756771-Reviews-or',n,
           '-Suckerpunch_St_Albans-St_Albans_Hertfordshire_England.html', sep = "")
    tripadvisor_urls <- c(tripadvisor_urls, url)
    }

And this is the function I wrote:

all_pages <- function(x) {
id_v <- c()
rating_v <- c()
headline_quote_v <- c()
date_v <- c()
review_v <- c()
for (url in x) {
        reviews <- url %>%
          read_html() %>%
          html_nodes("#REVIEWS .innerBubble")

        id <- reviews %>%
          html_node(".quote a") %>%
          html_attr("id")
        id_v <- c(id_v, id)

        headline_quote <- reviews %>%
          html_node(".quote span") %>%
          html_text()
        headline_quote_v <- c(headline_quote_v, headline_quote)

        rating_wrong <- url %>%
          read_html() %>%
          html_nodes("#REVIEWS .ui_bubble_rating") %>%
          as.character() %>%
          substr(38,39) %>%
          as.numeric()
        rating <- rating_wrong/10
        rating_v <- c(rating_v, rating)

        date <- reviews %>%
          html_node(".rating .ratingDate") %>%
          html_attr("title") %>%
          as.Date('%d %B %Y')
        date_v <- c(date_v, date)

        review <- reviews %>%
          html_node(".entry .partial_entry") %>%
          html_text()
        review_v <- c(review_v, review)
    }
tripadvisor <<- data.frame(id_v, headline_quote_v, rating_v, date_v, review_v)
}

all_pages(tripadvisor_urls)

When I look at the generated data frame, I see that there are duplicates:

duplicated(tripadvisor)

What have I done wrong? I would imagine it has something to do with constantly appending new elements to my vectors. What's the best way around this?

NOTE: I have requested permission from TripAdvisor so I am not violating their terms of service.

BadAtCoding
  • 199
  • 2
  • 10
  • 3
    Constantly appending is pretty terrible for performance and style, but it shouldn't be creating duplicates. And global assignment is pretty terrible style. Instead of writing function that takes a vector of URLs as input, write a function `get_one_page` that takes a single URL and returns a `list` or `data.frame` with all the components you want. Then you create the final product with `all_pages = lapply(x, get_one_page)` and you can combine the results. – Gregor Thomas May 10 '18 at 15:45
  • 2
    You might want to have a read of The R Inferno, where "growing" objects (appending in loops like you do) is one of the circles of R Hell. I'd also recommend reading my answer at [How to make a list of data frames](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207) for some guidance on using lists. – Gregor Thomas May 10 '18 at 15:46
  • 1
    As for the duplicates, maybe they're real or maybe you have a bug. I'd check first if they are real, and only if they're not try to track down the bug. – Gregor Thomas May 10 '18 at 15:47
  • Thanks @Gregor, I will read. Just for confirmation, you suggest `all_pages = lapply(x, get_one_page)` - in this case, is the `x` the list of URLs to lapply through? – BadAtCoding May 10 '18 at 15:51
  • Right, `x` is the *character vector* of URLs. When you start using `list` objects, you should start being careful about your language. Makes me cringe when I see your code with all the variables with "list" in their name that appear to be atomic vectors, not `list`s. – Gregor Thomas May 10 '18 at 15:53

1 Answers1

1

It seems to be occurring because you read from an url twice. If you remove the code that creates rating_list and subsequent references to it, you get 82 entries in the tripadvisor dataframe.

If you run this on your object you get:

which( tripadvisor$id_list %in% "rn576426120")
#[1] 1 83

If you follow my suggestion you will now get only a 1. You can further confirm this theory and probably see where the duplication occurs by inserting this debugging line outside the loop:

; lapply( list(id_list, headline_quote_list, date_list, rating_list, review_list), function(x) print(length(x)))

The allpages() call now produces:

> all_pages(tripadvisor_urls)
[1] 82
[1] 82
[1] 82
[1] 164
[1] 82
IRTFM
  • 240,863
  • 19
  • 328
  • 451