I would agree that the problem is most probably with encoding.
For instance, such problem seems to appear on nasa.gov website only on topic pages related to American-Russian space collaboration (which suggests that it is due to cyrillic characters in webpages content).
I solved the problem by using deprecated Relenium
where RSelenium
fails. To make Relenium
run smoothly on Ubuntu 16.04
I had to install Firefox 25.0
and configure it in a way to prevent any updates. The other issue during set up was to properly install rJava
, which can fail due to lack of environment variables with proper paths to Java libraries.
System configuration is as follows:
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
relenium_0.3.0; seleniumJars_2.41.0; rJava_0.9-8; RSelenium_1.3.5
Below is an example of a page that can be scraped with Relenium but not with release version of RSelenium:
link = "http://www.nasa.gov/mission_pages/station/expeditions/expedition14/index.html"
RSelenium solution fails (with Firefox of version either 34.0.5
, or 25.0
, no matter):
startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate(link)
doc = unlist(remDr$getPageSource())
Result: "Error in fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input"
While Relenium is ok with it:
relenium_browser <- firefoxClass$new()
relenium_browser$get(link)
doc = unlist(relenium_browser$getPageSource())
doc = read_html(doc)