1

I have a somewhat complex task that requires looking up a series of URLs which are contained in a dataframe, scraping some data from each URL, and then adding this data back into the original dataframe. Somehow I seem to have solved the most difficult part of this (the scraping part), but I'm having problems with how to automate the task (which I suspect is probably quite straightforward).

Here's the situation: I have a data.frame comprised of 12 variables and 44,000 rows. One of these variables, Programme_Synopsis_url contains the URL to a programme on the BBC iPlayer.

I need to go to that URL, extract one piece of data (details of the channel), and then add it to a new column called Channel.

Here is some sample data (I apologise for the size / complexity of this sample, but I think it's necessary to share this in order to get the right solution):

df <- structure(list(Title = structure(c(3L, 7L, 5L, 2L, 6L, 6L, 1L, 
4L, 9L, 8L), .Label = c("Asian Provocateur", "Cuckoo", "Dragons' Den", 
"In The Flesh", "Keeping Faith", "Lost Boys? What's Going Wrong For Asian Men", 
"One Hot Summer", "Travels in Trumpland with Ed Balls", "Two Pints of Lager and a Packet of Crisps"
), class = "factor"), Series = structure(c(1L, 1L, 1L, 3L, 1L, 
1L, 2L, 2L, 1L, 1L), .Label = c("", "Series 1-2", "Series 4"), class = "factor"), 
    Programme_Synopsis = structure(c(2L, 5L, 4L, 6L, 1L, 1L, 
    8L, 7L, 9L, 3L), .Label = c("", "1. The Dragons are back - with big money on the table.", 
    "1/3 Proud. Meeting rednecks", "1/8 Faith questions everything when her husband goes missing", 
    "4/6 What Happens in Ibiza... Is Megan really a party animal?", 
    "Box Set. Dale plans to propose – but what does Ken think?", 
    "Box Set. For the undead... life begins again", "Box Set. Romesh... and mum", 
    "Series 1-9. Box Set"), class = "factor"), Programme_Synopsis_url = structure(c(6L, 
    9L, 4L, 8L, 1L, 1L, 3L, 7L, 2L, 5L), .Label = c("", "https://www.bbc.co.uk/iplayer/episode/b00747zt/two-pints-of-lager-and-a-packet-of-crisps-series-1-1-fags-shags-and-kebabs", 
    "https://www.bbc.co.uk/iplayer/episode/b06fq3x4/asian-provocateur-series-1-1-uncle-thiru", 
    "https://www.bbc.co.uk/iplayer/episode/b09rjsq5/keeping-faith-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/b0bdpvhf/travels-in-trumpland-with-ed-balls-series-1-1-proud", 
    "https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p00szzcp/in-the-flesh-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p06f52g1/cuckoo-series-4-1-lawyer-of-the-year", 
    "https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza"
    ), class = "factor"), Programme_Duration = structure(c(6L, 
    4L, 6L, 1L, 6L, 6L, 2L, 5L, 3L, 6L), .Label = c("25 mins", 
    "28 mins", "29 mins", "40 mins", "56 mins", "59 mins"), class = "factor"), 
    Programme_Availability = structure(c(4L, 2L, 1L, 6L, 4L, 
    4L, 5L, 6L, 5L, 3L), .Label = c("Available for 1 month", 
    "Available for 11 months", "Available for 17 days", "Available for 28 days", 
    "Available for 3 months", "Available for 5 months"), class = "factor"), 
    Programme_Category = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 
    1L, 1L, 1L, 1L), .Label = c("Box Sets", "Featured", "Most Popular"
    ), class = "factor"), Programme_Genre = structure(c(4L, 2L, 
    3L, 5L, 2L, 2L, 1L, 3L, 1L, 2L), .Label = c("Comedy", "Documentary", 
    "Drama", "Entertainment", "New SeriesComedy"), class = "factor"), 
    date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = "13/08/2018", class = "factor"), rank = c(1L, 
    2L, 3L, 4L, 5L, 12L, 1L, 2L, 3L, 4L), row = c(1L, 1L, 1L, 
    1L, 1L, 3L, 4L, 4L, 4L, 4L), Box_Set = structure(c(1L, 1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

Just to make things even more complicated(!) there are two different types of URLs. Some point to the episode page for a programme and some point to the main programme page (there is no difference in the URL syntax in order to distinguish between the two). The reason this matters is because the data I want to scrape (name of the channel) is found in different places depending on whether it is a page for an episode, or the main page of a programme. I have written a script that gets this data for each of these types of pages:

### Get Channel for programme page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% html_text()

### Confirm Channel details ###
print(channel)


### Get Channel for episode page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% html_text()

### Confirm Channel details ###
print(channel)

The question is, how do I automate this, and loop through every URL (some 44,000), extract this data, and then add it to a new column called Channel?

A couple of final concerns / caveats /questions:

  1. Will looking up and scraping data from 44,000 URLs cause any technical problems? I don't want to kill the BBC's servers or get my IP blocked for doing this! I have checked the terms and conditions of their website(s) and there is no mention of scraping that I find.
  2. It might help to point out that although there are around 44,000 rows (of URLs) that I need to check, many of these are duplicates. As such, I wonder if it would be better to begin by creating a new dataframe that removes any duplicates (e.g. based on the Programme_Synopsis_url or Title columns). Doing this would mean that I would need to scrape a much smaller number of URLs, and could then merge this data back into the original dataframe. I.e. if the Title matches, then add variable from Channel column of the streamlined dataframe into a column called Channel in the original dataframe.
  3. I imagine I'm going to have to use some kind of loop with an if/else statement to do this. I.e. IF the URL contains a certain xpath then copy and paste that data into the Channel column for that row, ELSE copy the data from the other xpath and input that into the Channel column for that row. IF a page doesn't contain either xpath (which could be possible) then do nothing.

Hope that's all clear. Happy to elaborate if necessary.

EDIT: Updated one of the URLs in the code above which was incorrect.

Japes
  • 113
  • 7

1 Answers1

2

You can achieve that easily by the following approach:

  1. Create a function of your scraping part.
  2. Within this function you try the first Xpath, if the result is empty you try teh second Xpath
  3. You use any form of loop to repeat this task for all urls. (I used purrr::map but any loop would do)
library(rvest)

get_channel <- function(url) {
   ## some elements do not contain any url
   if (!nchar(url)) return(NA_character_)
   page <- url %>%
    read_html()
   ## try to read channel
   channel <- page %>% 
     html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% 
     html_text()
   ## if it's empty we are most likely on an episode page -> try the other xpath 
   if (!length(channel)) {
    channel <- page %>% 
       html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% 
       html_text()
   }
   ifelse(length(channel), channel, NA_character_)
}

## loop through all urls in the df

purrr::map_chr(as.character(df$Programme_Synopsis_url), get_channel)
# [1] "BBC Two"   "BBC Three" "BBC Three" "BBC Three" NA          NA          "BBC Three" "BBC Three" "BBC Three" "BBC Two" 

To your other questions:

  1. It could be that BBC tries to prevent you from scraping their page. There are some tricks to get around this, like adding delays between consecutive requests. Sometimes the webpages look for the User agent and you need to change that every n requests such that the website does not block you. There are several ways of how websites try to protect themselves from webscraping and it depends from cae to case what you need to do. Having said that, I do not believe that 44k requests even come close to kill their service, but I am not an expert here.
  2. It definitely makes sense to avoid requesting the duplicated urls, and this can be easily achieved by [untested]:

    new_df <- df[!duplicated(df$Programme_Synopsis_url), ]
    new_df$channel <- purrr::map_chr(as.character(new_df$Programme_Synopsis_url), 
                                     get_channel)
    dplyr::left_join(df, 
                     new_df[, c("Programme_Synopsis_url", "channel")], 
                     by = "Programme_Synopsis_url")
    
thothal
  • 11,321
  • 1
  • 23
  • 56
  • Thanks for the suggestions - almost working perfectly. When I remove duplicates there are only 1500 unique URLs that need to be scraped, so this should make the task much easier. However, when I run the function, I'm getting the following error: `Error in open.connection(x, "rb") : HTTP error 404. ` Any thoughts? – Japes Sep 04 '19 at 12:39
  • Looks like `tryCatch` might be the solution? https://stackoverflow.com/questions/38114066/using-trycatch-and-rvest-to-deal-with-404-and-other-crawling-errors – Japes Sep 04 '19 at 12:45
  • Error 404 means that teh url could not be found [cf. Wiki](https://en.wikipedia.org/wiki/HTTP_404). You could add a `tryCatch` arround your `read_html` to safeguard against this error – thothal Sep 04 '19 at 12:46
  • Thanks, as I suspected, but can't quite figure out where to put this within the function? I'm fairly new to R in case it wasn't obvious!... – Japes Sep 04 '19 at 12:52
  • Easiest is to use `purrr::possibly`. So try `purrr::map_chr(as.character(df$Programme_Synopsis_url), purrr::possibly(get_channel, NA_character_))` – thothal Sep 04 '19 at 12:57