3

Why is the page source of youtube.com not scrapeable?

I tried the following (using phantomjs as well as chrome with a selenium server)

library(RSelenium)
pJS <- phantom(pjs_cmd = ...)
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/")
remDr$getTitle()[[1]] # [1] "YouTube"
remDr$getPageSource()

Returns:

Error in fromJSON(content, handler, default.size, depth, allowComments,  : 
  invalid JSON input
Rentrop
  • 18,602
  • 6
  • 64
  • 93

2 Answers2

3

Its an issue with encoding. Use the dev version for now until the next version is released to CRAN:

devtools::install_github("ropensci/RSelenium")
jdharrison
  • 28,335
  • 4
  • 67
  • 86
0

I would agree that the problem is most probably with encoding.

For instance, such problem seems to appear on nasa.gov website only on topic pages related to American-Russian space collaboration (which suggests that it is due to cyrillic characters in webpages content).

I solved the problem by using deprecated Relenium where RSelenium fails. To make Relenium run smoothly on Ubuntu 16.04 I had to install Firefox 25.0 and configure it in a way to prevent any updates. The other issue during set up was to properly install rJava, which can fail due to lack of environment variables with proper paths to Java libraries.

System configuration is as follows:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

relenium_0.3.0; seleniumJars_2.41.0; rJava_0.9-8; RSelenium_1.3.5 

Below is an example of a page that can be scraped with Relenium but not with release version of RSelenium:

link = "http://www.nasa.gov/mission_pages/station/expeditions/expedition14/index.html"

RSelenium solution fails (with Firefox of version either 34.0.5, or 25.0, no matter):

startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(link)
doc = unlist(remDr$getPageSource())

Result: "Error in fromJSON(content, handler, default.size, depth, allowComments, : invalid JSON input"

While Relenium is ok with it:

 relenium_browser <- firefoxClass$new()
 relenium_browser$get(link)
 doc = unlist(relenium_browser$getPageSource())
 doc = read_html(doc)
  • There is an issue with firefox 47 and selenium. There is a released firefox 47.01 but it seems not available for ubuntu. You can install firefox 48 which should work with the current version of RSelenium. See https://www.mozilla.org/en-US/firefox/47.0.1/releasenotes/ – jdharrison Aug 02 '16 at 15:08