9

I want to scrape the match time and date from this url:

http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary

By using the chrome dev tools, I can see this appears to be generated using the following code:

<td colspan="3" id="utime" class="mstat-date">01:20 AM, October 29, 2014</td>

But this is not in the source html.

I think this is because its java (correct me if Im wrong). How can I scrape this information using R?

hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
Liam Flynn
  • 1,479
  • 2
  • 15
  • 15
  • 1
    I haven't looked at the source, but often the answer is [RSelenium](http://cran.r-project.org/web/packages/RSelenium/index.html). – jbaums Oct 29 '14 at 13:24

2 Answers2

14

So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn't require java):

library(rvest)

# render HTML from the site with phantomjs

url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

system("phantomjs scrape.js > scrape.html", intern = T)

# extract the content you need
pg <- html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()

## [1] "10:20 AM, October 28, 2014"
hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
  • 1
    I would note that RSelenium allows you to drive phantomjs and also doesnt require java http://rpubs.com/johndharrison/RSelenium-headless. – jdharrison Jan 01 '15 at 00:35
  • `system("phantomjs scrape.js > scrape.html")` seems to return the (`HTML`?) script in the `console` and `pg – niko Mar 09 '18 at 11:36
  • @nate.edwinton I was having the same issue. Try `sink("scrape.txt"); system("phantomjs scrape.js > scrape.html"); sink();` – JdeMello Mar 11 '18 at 20:52
  • I had the same issue as you and it worked for me. However, I still did not get the javascript-rendered content from the website so I wonder whether there must be any extra piece of code in the js-function body... – JdeMello Mar 11 '18 at 20:55
  • 1
    phantomJS is now deprecated ... http://phantomjs.org/ What can we do??? :( – ℕʘʘḆḽḘ Sep 22 '18 at 00:38
0

You could also use docker as the web driver (in place of selenium)

You will still need to install phantomjs, and docker too. Then run:

library(RSelenium)

url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"

system('docker run -d -p 4445:4444 selenium/standalone-chrome') 
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome")
remDr$open()
remDr$navigate(url)

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

system("phantomjs scrape.js > scrape.html", intern = T)

# extract the content you need
pg <- read_html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()

# [1] "10:20 AM, October 28, 2014"
stevec
  • 15,490
  • 6
  • 67
  • 110