10

I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#

I can do the following:

library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")

but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is

html_text(doc)

but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.

What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.

Can anyone provide some hints as to how to extract that data from this site?

Peter Verbeet
  • 1,576
  • 1
  • 12
  • 26
  • Is this a one-time thing or do you need to do this on a regular basis? Seems like copy/paste into a text editor and doing some minor modifications could get you rolling much faster that digging into the code. – Jason V Apr 03 '15 at 12:29
  • I'd go the RSelenium route for this one. It will render the page and you can then access the DOM elements directly vs try to decode that data blob in `
    `
    – hrbrmstr Apr 03 '15 at 12:46
  • @JasonV This will be on a regular basis. There are multiple competitions I want to scrape and will need to stay up-to-date over time. So I definitely prefer a non-manual solution. – Peter Verbeet Apr 03 '15 at 12:50

1 Answers1

11

Using Selenium with phantomjs

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)

if you want to press the more data button until it is not visible (all matches presumed showing):

webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
  webElem$clickElement()
  Sys.sleep(5)
  webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])

Remove unwanted round data and use XML::readHTMLTable for simplicity

# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank         Date           hteam            ateam score
1       01.04. 18:00     Ferronikeli          Ferizaj 4 : 0
2       01.04. 18:00          Istogu         Hajvalia 2 : 1
3       01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4       01.04. 18:00       Prishtina          Drenica 3 : 0
5       31.03. 18:00       Besa Peje            Drita 1 : 0
6       31.03. 18:00       Trepca 89       Vellaznimi 2 : 0

> tail(appData)
    blank         Date            hteam     ateam score
115       17.08. 22:00        Besa Peje Trepca 89 3 : 3
116       17.08. 22:00      Ferronikeli  Hajvalia 2 : 5
117       17.08. 22:00 Trepca Mitrovice   Ferizaj 1 : 0
118       17.08. 22:00       Vellaznimi   Drenica 2 : 1
119       16.08. 22:00  Kosova Vushtrri     Drita 0 : 1
120       16.08. 22:00        Prishtina    Istogu 2 : 1

carry out further formatting as needed.

jdharrison
  • 28,335
  • 4
  • 67
  • 86
  • thanks! The "remDr$getPageSource()" part of the code gives an error ("Error in fromJSON(content, handler, default.size, depth, allowComments, : invalid JSON input"). Any idea what's going on? – Peter Verbeet Apr 03 '15 at 15:04
  • Try installing the dev version of `RSelenium`. `devtools::install_github("ropensci/RSelenium")`. – jdharrison Apr 03 '15 at 15:34
  • Hi John, that works superbly! I am totally stunnted by how you've solved this so quickly and elegantly. One quick additional question: how could I get RSelenium to activate the "Show more matches" button (below the data table), so that all results are included? (it now starts at round 3, but "Show more matches" also shows rounds 1 and 2). – Peter Verbeet Apr 03 '15 at 16:04
  • @PeterVerbeet you can press the show more matches link until it is not visible. You may need to press this more then once so I have added a while statement that conditions on the element being visible. – jdharrison Apr 03 '15 at 16:27
  • that's absolutely perfect, @jdharrison. Thank you for being so helpful. – Peter Verbeet Apr 03 '15 at 19:30
  • it turns out there still is a snag. I used your code to get the data from http://www.soccer24.com/netherlands/eerste-divisie-2013-2014/results/. Even though the "see more results" button is clicked until is no longer shows (which I can tell from remDr$screenshot(display = TRUE), the remDr$getPageSource()[[1]] only includes the matches from the start page, rather than ALL matches. Any idea what is going on? ) – Peter Verbeet Apr 10 '15 at 11:19
  • In this case there are extra end of season results in a seperate table. You can `rbind` the two tables together when this happens. – jdharrison Apr 10 '15 at 12:29
  • HI John, yes I saw that, there are two tables with results, but together they still only include the matches shown on the startpage, rather than all the matches (that show after clicking the "show more results" element). – Peter Verbeet Apr 10 '15 at 12:43