23

The task:

I wanted to scrape all the YouTube comments from a given video.

I successfully adapted the R code from a previous question (Scraping Youtube comments in R).

Here is the code:

library(RCurl)
library(XML)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
html = getURL(x)
doc  = htmlParse(html, asText=TRUE) 
txt  = xpathSApply(doc, 
"//body//text()[not(ancestor::script)][not(ancestor::style)[not(ancestor::noscript)]",xmlValue)

To use it, simply replace the video ID (i.e. "4H9pTgQY_mo") with the ID you require.

The problem:

The problem is that it doesn't return all the comments. In fact, it always returns a vector with 283 elements, regardless of how many comments are in the video.

Can anyone please shed light on what is going wrong here? It is incredibly frustrating. Thank you.

Community
  • 1
  • 1
timothyjgraham
  • 1,042
  • 1
  • 14
  • 26
  • 3
    @hrbrmstr please don't link to deprecated APIs. That link is really not helpful to the OP. – JAL Apr 17 '15 at 14:17
  • @JAL Thanks. This is precisely the problem in regards to the API approach. – timothyjgraham Apr 17 '15 at 14:32
  • I cannot see what link was posted above, but how about this one? https://developers.google.com/youtube/v3/docs/commentThreads/list – tonytonov Apr 20 '15 at 08:43
  • This is not an answer but it sounds like you're bumping into a "page size" limit and you need to crawl through the many "pages" of results yourself. Here is a [related question](http://stackoverflow.com/questions/16227540/need-help-to-get-more-than-100-results-using-youtube-search-api) that might get you started. – jennybryan Apr 24 '15 at 03:49
  • @jennybryan Thanks for this suggestion, however that question relates to querying videos rather than comments. – timothyjgraham Apr 25 '15 at 00:16
  • The part that I believe to be relevant to you is in one of the answers: "The correct way to page through a feed is to make the first request for the feed without a start-index, and then check to see whether there's a – jennybryan Apr 30 '15 at 23:43

4 Answers4

8

I was (for the most part) able to accomplish this by using the latest version of the Youtube Data API and the R package httr. The basic approach I took was to send multiple GET requests to the appropriate URL and grab the data in batches of 100 (the maximum the API allows) - i.e.

base_url <- "https://www.googleapis.com/youtube/v3/commentThreads/"
api_opts <- list(
  part = "snippet",
  maxResults = 100,
  textFormat = "plainText",
  videoId = "4H9pTgQY_mo",  
  key = "my_google_developer_api_key",
  fields = "items,nextPageToken",
  orderBy = "published")

where key is your actual Google Developer key, of course.

The initial batch is retrieved like this:

init_results <- httr::content(httr::GET(base_url, query = api_opts))
##
R> names(init_results)
#[1] "nextPageToken" "items"
R> init_results$nextPageToken
#[1] "Cg0Q-YjT3bmSxQIgACgBEhQIABDI3ZWQkbzEAhjVneqH75u4AhgCIGQ="       
R> class(init_results)
#[1] "list"

The second element - items - is the actual result set from the first batch: it's a list of length 100, since we specified maxResults = 100 in the GET request. The first element - nextPageToken - is what we use to make sure each request returns the appropriate sequence of results. For example, we can get the next 100 results like this:

api_opts$pageToken <- gsub("\\=","",init_results$nextPageToken)
next_results <- httr::content(
    httr::GET(base_url, query = api_opts))
##
R> next_results$nextPageToken
#[1] "ChYQ-YjT3bmSxQIYyN2VkJG8xAIgACgCEhQIABDI3ZWQkbzEAhiSsMv-ivu0AhgCIMgB"

where the current request's pageToken is returned as the previous requests nextPageToken, and we are given a new nextPageToken for obtaining out next batch of results.


This is pretty straightforward, but it would obviously be very tedious to have to keep changing the value of nextPageToken by hand after each request we send. Instead I thought this would be a good use case for a simple R6 class:

yt_scraper <- setRefClass(
  "yt_scraper",
  fields = list(
    base_url = "character",
    api_opts = "list",
    nextPageToken = "character",
    data = "list",
    unique_count = "numeric",
    done = "logical",
    core_df = "data.frame"),

  methods = list(
    scrape = function() {
      opts <- api_opts
      if (nextPageToken != "") {
        opts$pageToken <- nextPageToken
      }

      res <- httr::content(
        httr::GET(base_url, query = opts))

      nextPageToken <<- gsub("\\=","",res$nextPageToken)
      data <<- c(data, res$items)
      unique_count <<- length(unique(data))
    },

    scrape_all = function() {
      while (TRUE) {
        old_count <- unique_count
        scrape()
        if (unique_count == old_count) {
          done <<- TRUE
          nextPageToken <<- ""
          data <<- unique(data)
          break
        }
      }
    },

    initialize = function() {
      base_url <<- "https://www.googleapis.com/youtube/v3/commentThreads/"
      api_opts <<- list(
        part = "snippet",
        maxResults = 100,
        textFormat = "plainText",
        videoId = "4H9pTgQY_mo",  
        key = "my_google_developer_api_key",
        fields = "items,nextPageToken",
        orderBy = "published")
      nextPageToken <<- ""
      data <<- list()
      unique_count <<- 0
      done <<- FALSE
      core_df <<- data.frame()
    },

    reset = function() {
      data <<- list()
      nextPageToken <<- ""
      unique_count <<- 0
      done <<- FALSE
      core_df <<- data.frame()
    },

    cache_core_data = function() {
      if (nrow(core_df) < unique_count) {
        sub_data <- lapply(data, function(x) {
          data.frame(
            Comment = x$snippet$topLevelComment$snippet$textDisplay,
            User = x$snippet$topLevelComment$snippet$authorDisplayName,
            ReplyCount = x$snippet$totalReplyCount,
            LikeCount = x$snippet$topLevelComment$snippet$likeCount,
            PublishTime = x$snippet$topLevelComment$snippet$publishedAt,
            CommentId = x$snippet$topLevelComment$id,
            stringsAsFactors=FALSE)
        })
        core_df <<- do.call("rbind", sub_data)
      } else {
        message("\n`core_df` is already up to date.\n")
      } 
    }
  )
)

which can be used like this:

rObj <- yt_scraper()
##
R> rObj$data
#list()
R> rObj$unique_count
#[1] 0
##
rObj$scrape_all()
##
R> rObj$unique_count
#[1] 1673
R> length(rObj$data)
#[1] 1673
R> ##
R> head(rObj$core_df)
                                                           Comment              User ReplyCount LikeCount              PublishTime
1                    That Andorra player was really Ruud..<U+feff>         Cistrolat          0         6 2015-03-22T14:07:31.213Z
2                          This just in; Karma is a bitch.<U+feff> Swagdalf The Obey          0         1 2015-03-21T20:00:26.044Z
3                                          Legend! Haha B)<U+feff>  martyn baltussen          0         1 2015-01-26T15:33:00.311Z
4 When did Van der sar ran up? He must have run real fast!<U+feff> Witsakorn Poomjan          0         0 2015-01-04T03:33:36.157Z
5                           <U+003c>b<U+003e>LOL<U+003c>/b<U+003e>           F Hanif          5        19 2014-12-30T13:46:44.028Z
6                                          Fucking Legend.<U+feff>        Heisenberg          0        12 2014-12-27T11:59:39.845Z
                            CommentId
1   z123ybioxyqojdgka231tn5zbl20tdcvn
2   z13hilaiftvus1cc1233trvrwzfjg1enm
3 z13fidjhbsvih5hok04cfrkrnla2htjpxfk
4   z12js3zpvm2hipgtf23oytbxqkyhcro12
5 z12egtfq5ojifdapz04ceffqfrregdnrrbk
6 z12fth0gemnwdtlnj22zg3vymlrogthwd04

As I alluded to earlier, this gets you almost everything - 1673 out of about 1790 total comments. For some reason, it does not seem to catch users' nested replies, and I'm not quite sure how to specify this within the API framework.


I had previously set up a Google Developer account a while back for using the Google Analytics API, but if you haven't done that yet, it should be pretty straightforward. Here's an overview - you shouldn't need to set up OAuth or anything like that, just make a project and create a new Public API access key.

nrussell
  • 17,257
  • 4
  • 42
  • 56
  • 1
    Thanks, @nrussell! I can confirm this solutions work well for the non-nested comments (i.e. most comments on any given video). This is certainly enough to get started. Cheers. – timothyjgraham Apr 26 '15 at 06:20
1

An alternative to the XML package is the rvest package. Using the URL that you've provided, scraping comments would look like this:

library(rvest)
x <- "https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?orderby=published"
x %>% 
  html %>% 
  html_nodes("content") %>% 
  html_text

Which returns a character vector of the comments:

[1] "That Andorra player was really Ruud.."                                                                  
[2] "This just in; Karma is a bitch."                                                                        
[3] "Legend! Haha B)"                                                                                        
[4] "When did Van der sar ran up? He must have run real fast!"                                               
[5] "What a beast Ruud was!"
...

More information on rvest can be found here.

francojc
  • 23
  • 6
  • Hi @francojc, thank you for your suggestion, but this does not address the task I wanted to achieve as stated in my question. The code you provided returns 25 comments, but there are approximately 1800 comments on the video in the above example. I would like to return all the comments, not a small subset. Thanks. – timothyjgraham Apr 24 '15 at 00:12
1

Your issue lies with getting max results.

Solution Algorithm

First you need to call url https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo?v=2 This url contains the information for the video comments count, from there extract that number and us it to iterate over.

<gd:comments>&ltgd:feedLink ..... countHint='1797'/></gd:comments>

After that use it to iterate thought url with these 2 parameters https://gdata.youtube.com/feeds/api/videos/4H9pTgQY_mo/comments?max-results=50&start-index=1
When you are iterating you need to change start-index from 1,51,101,151... Did test the max-result it has limit to 50.

movingSone
  • 34
  • 3
-1

I tried for different videos with "tuber" package in R and my results here. If one author has only replies (doesnt have comment about video) ,then according to number of replies behave.If the author has not more than 5 replies then dont scrape anyone.But if has more than 5 replies then some comments are scraping. And if one author has both himself comments and replies then more than second man (up I told) comments are scraping.

mr t
  • 1
  • 1