0

I'm looking to extract only the video id string from a column of youtube links.

The stringr function I'm currently using is this:

str_extract(data$link, "\\b[^=]+$")

This works for most standard youtube links with the id at the end of the url appearing after an = sign i.e.

youtube.com/watch?v=kFF0v0FQzEI

However not all links follow this pattern, examples:

youtube.com/v/kFF0v0FQzEI
youtube.com/vi/kFF0v0FQzEI
youtu.be/kFF0v0FQzEI
www.youtube.com/v/kFF0v0FQzEI?feature=autoshare&version=3&autohide=1&autoplay=1
www.youtube.com/watch?v=kFF0v0FQzEI&list=PLuV2ACKGzAMsG-pem75yNYhBvXZcl-mj_&index=1

So could anyone help me out with an R regex pattern to extract the id (kFF0v0FQzEI in this case) in all the examples above?

I've seen examples of regex patterns used in other languages to do this but I'm unsure how to convert to R compliance.

Thanks!

Paul Campbell
  • 691
  • 4
  • 8
  • 1
    Possible duplicate of [JavaScript REGEX: How do I get the YouTube video id from a URL?](https://stackoverflow.com/questions/3452546/javascript-regex-how-do-i-get-the-youtube-video-id-from-a-url) – Tim Biegeleisen Aug 01 '17 at 15:32
  • I think you can poke around and find a regex on Stack Overflow for this. If you get stuck with the R part of that, then come back with a more focused question. – Tim Biegeleisen Aug 01 '17 at 15:32

1 Answers1

2

You could use something like the following, but note that it's pretty heavily hard-coded to the examples you provided.

links = c("youtube.com/v/kFF0v0FQzEI", 
          "youtube.com/vi/kFF0v0FQzEI", 
          "youtu.be/kFF0v0FQzEI", 
          "www.youtube.com/v/kFF0v0FQzEI?feature=autoshare&version=3&autohide=1&autoplay=1", 
          "www.youtube.com/watch?v=kFF0v0FQzEI&list=PLuV2ACKGzAMsG-pem75yNYhBvXZcl-mj_&index=1", 
          "youtube.com/watch?v=kFF0v0FQzEI", 
          "http://www.youtube.com/watch?argv=xyz&v=kFF0v0FQzEI")

get_id = function(link) {
  if (stringr::str_detect(link, '/watch\\?')) {
    rgx = '(?<=\\?v=|&v=)[\\w]+'
  } else {
    rgx = '(?<=/)[\\w]+/?(?:$|\\?)'
  }
  stringr::str_extract(link, rgx)
}

ids = unname(sapply(links, get_id))
# [1] "kFF0v0FQzEI"  "kFF0v0FQzEI"  "kFF0v0FQzEI"  "kFF0v0FQzEI?" 
#     "kFF0v0FQzEI"  "kFF0v0FQzEI"  "kFF0v0FQzEI"
brittenb
  • 5,849
  • 3
  • 30
  • 58