13

I'm extracting user comments from a range of websites (like reddit.com) and Youtube is also another juicy source of information for me. My existing scraper is written in R:

# x is the url
html = getURL(x)
doc  = htmlParse(html, asText=TRUE) 
txt  = xpathSApply(doc, 
   //body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue) 

This doesn't work on Youtube data, in fact if you look at the source of a Youtube video like this for example, you'd find that comments do not appear in the source.

Does anyone have any suggestions on how to extract data in such circumstances?

Many thanks!

I'm Geeker
  • 4,642
  • 5
  • 19
  • 40
IVR
  • 1,321
  • 1
  • 11
  • 30
  • 2
    They are probably being download via javascript after page load. You can use the Chrome Developer tools to look for requests for comments at a different URL perhaps, or use a package like `RSelenium` which can interact with browser to execute the javascript on a page. – MrFlick Aug 10 '14 at 01:45
  • 5
    You should be using Youtube's api for this- it will give you much more consistent results, and will warn you when it is going to be changed. You can read about it at https://developers.google.com/youtube/articles/changes_to_comments or http://stackoverflow.com/questions/19965856/how-to-get-all-comments-on-a-youtube-video – waternova Aug 10 '14 at 03:51
  • 2
    Thanks a lot guys, following waternova's links I've found that using the following URL (where VID = Video ID) gives me what I want: `https://gdata.youtube.com/feeds/api/videos/VID/comments?orderby=published` Cheers! – IVR Aug 10 '14 at 08:22
  • @de1pher, Feel free to answer your own question (and accept it) so that it doesn't remain in the unanswered queue. – A5C1D2H2I1M1N2O1R2T1 Dec 17 '14 at 03:59

2 Answers2

6

Following this Answer: R: rvest: scraping a dynamic ecommerce page

You can do the following:

devtools::install_github("ropensci/RSelenium") # Install from github

library(RSelenium)
library(rvest)
pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY")
remDr$getTitle()[[1]] # [1] "YouTube"

# scroll down
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

# Get page source and parse it via rvest
page_source <- remDr$getPageSource()
author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text()
text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text()

#combine the data in a data.frame
dat <- data.frame(author = author, text = text)

Result:
> head(dat)
              author                                                                                       text
1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!!
2   Tatjana Celinska                                                                                     Ciao 0
3      Yvette Austin                                                                   GET OUT OF MYÂ  HEAD!!!!
4           Susan II                                                                             Watch narhwals
5        Greg Ginger               who in the entire fandom never watched this, should be ashamed,\n\nPFFFTT!!!
6        Arnav Sinha                                                                 LOL what the hell is this?

Comment 1: You do need the github version see rselenium | get youtube page source

Comment 2: This code gives you the initial 44 comments. Some comments have a "show all answers" link that would have to click. Also to see even more comments you have to click the show more button at the bottom of the page. Clicking is explined in this excelent RSelenium tutorial: http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

Community
  • 1
  • 1
Rentrop
  • 18,602
  • 6
  • 64
  • 93
0

you mentioned that the youtube comments do not appear in the html source code of a youtube page. However, I used to the developer tools build into Chrome, and I was able to see the html markup of that makes up the comments. I also tried loading the page with scripting blocked and the comments were still there.

Assuming your parser can see the comments, the following xpath should allows you to select the content of the comments.

"//body//div[@class='comment-text-content']/text()"

Alternatively if you want to select all the information inside the comment block, such as the users name, you can use this expression

"//body/div[@class='comments']//div[@class='comment-item']"
Dirk7589
  • 17
  • 3
  • 1
    Thanks for your response @Dirk7589, I'm afraid that approach doesn't seem to work, I get NULL in response. Upon a closer inspection I've found something weird going on: `xpathSApply(doc,'//*/span[@class="yt-spinner-message"]')` returns the following results: ` [[1]] Loading... ` so much of the structure isn't available -- it's loading... ? – IVR Apr 18 '15 at 07:48