Why is my function creating duplicates?

Question

I have written a function that iterates through URLs and scrapes the data I need from each page.

library(xml2)
library(rvest)

The below creates a vector of the relevant URLs:

tripadvisor_urls <- c()

for (n in seq(0, 80, 10)) {
    url <- paste('https://www.tripadvisor.co.uk/Attraction_Review-g186306-d9756771-Reviews-or',n,
           '-Suckerpunch_St_Albans-St_Albans_Hertfordshire_England.html', sep = "")
    tripadvisor_urls <- c(tripadvisor_urls, url)
    }

And this is the function I wrote:

all_pages <- function(x) {
id_v <- c()
rating_v <- c()
headline_quote_v <- c()
date_v <- c()
review_v <- c()
for (url in x) {
        reviews <- url %>%
          read_html() %>%
          html_nodes("#REVIEWS .innerBubble")

        id <- reviews %>%
          html_node(".quote a") %>%
          html_attr("id")
        id_v <- c(id_v, id)

        headline_quote <- reviews %>%
          html_node(".quote span") %>%
          html_text()
        headline_quote_v <- c(headline_quote_v, headline_quote)

        rating_wrong <- url %>%
          read_html() %>%
          html_nodes("#REVIEWS .ui_bubble_rating") %>%
          as.character() %>%
          substr(38,39) %>%
          as.numeric()
        rating <- rating_wrong/10
        rating_v <- c(rating_v, rating)

        date <- reviews %>%
          html_node(".rating .ratingDate") %>%
          html_attr("title") %>%
          as.Date('%d %B %Y')
        date_v <- c(date_v, date)

        review <- reviews %>%
          html_node(".entry .partial_entry") %>%
          html_text()
        review_v <- c(review_v, review)
    }
tripadvisor <<- data.frame(id_v, headline_quote_v, rating_v, date_v, review_v)
}

all_pages(tripadvisor_urls)

When I look at the generated data frame, I see that there are duplicates:

duplicated(tripadvisor)

What have I done wrong? I would imagine it has something to do with constantly appending new elements to my vectors. What's the best way around this?

NOTE: I have requested permission from TripAdvisor so I am not violating their terms of service.

Constantly appending is pretty terrible for performance and style, but it shouldn't be creating duplicates. And global assignment is pretty terrible style. Instead of writing function that takes a vector of URLs as input, write a function `get_one_page` that takes a single URL and returns a `list` or `data.frame` with all the components you want. Then you create the final product with `all_pages = lapply(x, get_one_page)` and you can combine the results. — Gregor Thomas, May 10 '18 at 15:45
You might want to have a read of The R Inferno, where "growing" objects (appending in loops like you do) is one of the circles of R Hell. I'd also recommend reading my answer at [How to make a list of data frames](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207) for some guidance on using lists. — Gregor Thomas, May 10 '18 at 15:46
As for the duplicates, maybe they're real or maybe you have a bug. I'd check first if they are real, and only if they're not try to track down the bug. — Gregor Thomas, May 10 '18 at 15:47
Thanks @Gregor, I will read. Just for confirmation, you suggest `all_pages = lapply(x, get_one_page)` - in this case, is the `x` the list of URLs to lapply through? — BadAtCoding, May 10 '18 at 15:51
Right, `x` is the *character vector* of URLs. When you start using `list` objects, you should start being careful about your language. Makes me cringe when I see your code with all the variables with "list" in their name that appear to be atomic vectors, not `list`s. — Gregor Thomas, May 10 '18 at 15:53

score 1 · Accepted Answer · answered May 10 '18 at 16:17

It seems to be occurring because you read from an url twice. If you remove the code that creates rating_list and subsequent references to it, you get 82 entries in the tripadvisor dataframe.

If you run this on your object you get:

which( tripadvisor$id_list %in% "rn576426120")
#[1] 1 83

If you follow my suggestion you will now get only a 1. You can further confirm this theory and probably see where the duplication occurs by inserting this debugging line outside the loop:

; lapply( list(id_list, headline_quote_list, date_list, rating_list, review_list), function(x) print(length(x)))

The allpages() call now produces:

> all_pages(tripadvisor_urls)
[1] 82
[1] 82
[1] 82
[1] 164
[1] 82

Why is my function creating duplicates?

1 Answers1