R scraping a table on https site

Question

I am trying to scrape the report log table from the following website "https://www.heritageunits.com/Locomotive/Detail/NS8098" using the RCurl package with the attached code. It pulls in elements from the page, but when I scroll through the 10 items in the list stored under "page", none of the elements include the table.

library("RCurl")
# Read page
page <- GET(
  url="https://heritageunits.com/Locomotive/Detail/NS8098",
  config(cainfo = cafile), ssl.verifyhost = FALSE
)

I would also like to scrape the data from the tables on this page when you toggle to the reports from the previous days, but am not sure how to code this in R to select any previous report pages. Any help would be appreciated. Thanks.

Leaving a comment for future readers: this is an excellent example of where RSelenium is absolutely needed. See my comment on JackStat's answer as to why. Great, practical example to have here in SO. — hrbrmstr, Feb 15 '16 at 12:32
Having said that, the site has an API (it's in their Terms of Service docs). You should reach out to them to see if you could just hit the API with `httr` vs do all this scraping. Scraping is much more fragile than API calls. — hrbrmstr, Feb 15 '16 at 12:33

score 3 · Accepted Answer · answered Feb 15 '16 at 03:57

3

Occasionally I am able to find a json file in the source that you can it directly but I couldn't find one. I went with RSelenium and had it click the next button and cycle through. This method is frail so you have to pay attention when you run it. If the datatable is not fully loaded it will duplicate that last page so I used a small Sys.sleep to make sure that it waited long enough. I would recommend checking for duplicate rows at the end to catch this. Again it is frail but it works

library(RSelenium)
library(XML)
library(foreach)


# Start Selenium server
checkForServer()
startServer()

remDr <- 
  remoteDriver(
    remoteServerAddr = "localhost" 
    , port = 4444
    , browserName = "chrome"
)

remDr$open()

# Navigate to page
remDr$navigate("https://www.heritageunits.com/Locomotive/Detail/NS8098")

# Snag the html
outhtml <- remDr$findElement(using = 'xpath', "//*")
out<-outhtml$getElementAttribute("outerHTML")[[1]]

# Parse with RCurl
doc<-htmlParse(out, encoding = "UTF-8")

# get the last page so we can cycle through
PageNodes <- getNodeSet(doc, '//*[(@id = "history_paginate")]')
Pages <- sapply(X = PageNodes, FUN = xmlValue)
LastPage = as.numeric(gsub('Previous12345\\…(.*)Next', '\\1',Pages))


# loop through one click at a time
Locomotive <- foreach(i = 1:(LastPage-1), .combine = 'rbind', .verbose = TRUE) %do% {

  if(i == 1){

    readHTMLTable(doc)$history

  } else {

    nextpage <- remDr$findElement("css selector", '#history_next')
    nextpage$sendKeysToElement(list(key ="enter"))

    # Take it slow so it gets each page
    Sys.sleep(.50)

    outhtml <- remDr$findElement(using = 'xpath', "//*")
    out<-outhtml$getElementAttribute("outerHTML")[[1]]

    # Parse with RCurl
    doc<-htmlParse(out, encoding = "UTF-8")
    readHTMLTable(doc)$history
  }


}

answered Feb 15 '16 at 03:57

JackStat

1,533
1
11
16

There is a JSON source: `https://heritageunits.com/Locomotive/DetailHistory` BUT the site also uses javascript to populate extra form fields that are dynamically calculated. RSelenium is the only way to go with this one unless one wanted to figure out a way to extract and pass the js execution stuff into V8, but even then it seems to rely on DOM presence, so that prbly wouldn't work. – hrbrmstr Feb 15 '16 at 12:30
Thanks @JackStat. Much appreciated. I ran this code and get an error: Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: java.lang.IllegalStateException at the remDr$open() line in the code. I thought it may have something to do with my older laptop with Windows 7 enterprise, tried it on Windows 10 desktop and at the same code location got: Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)). Any thoughts? – mah271 Feb 16 '16 at 03:18
try an empty remote driver `remDr – JackStat Feb 16 '16 at 03:50
@JackStat: Thanks. Still doesn't work. Am investigating Windows drivers, Java interface/version and R Selenium. Seems like the problem lies in the connection between the 3. By chance, did you do this on a Mac? – mah271 Feb 22 '16 at 03:32
@JackStat I have been working with this over the past several months and did finally get it to work on a Windows PC. – mah271 Oct 24 '16 at 16:28
To get this to work on a PC, complete the following: – mah271 Oct 24 '16 at 16:30
1. Insure that the selenium-server-standalone.jar file and the Google Chrome driver are in the same folder as the Windows command directory setting 2. Open Windows command 3. Type in "java -jar selenium-server-standalone.jar" and hit enter – mah271 Oct 24 '16 at 16:30

score 1 · Answer 2 · edited May 23 '17 at 12:06

1

Missed by a few minutes. I took the RSelenium snippet found on another question and altered to suit. I think this one's a little shorter though. I didn't hit any issues with the page not loading.

## required packages
library(RSelenium)
library(rvest)
library(magrittr)
library(dplyr)


## start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

## send Selenium to the page
remDr$navigate("https://www.heritageunits.com/Locomotive/Detail/NS8098")

## get the page html
page_source <- remDr$getPageSource()

## parse it and extract the table, convert to data.frame
read_html(page_source[[1]]) %>% html_nodes("table") %>% html_table() %>% extract2(1)

edited May 23 '17 at 12:06

Community

1
1

answered Feb 15 '16 at 04:02

Jonathan Carroll

3,637
12
33

Nice! I wrote the loop to get the other pages. You should also add `read_html(page_source[[1]]) %>% html_nodes("table") %>% html_table() %>% .[[1]]` – JackStat Feb 15 '16 at 04:06
Cool. I just meant the bits that `rvest` takes care of. I'm not sure that `rvest` could do the next page bits, so your solution is better in that respect. – Jonathan Carroll Feb 15 '16 at 04:14
Thanks @Jonathan Carroll. Much appreciated as well. Similar to my comment above, I get an error at the remDr$open() code line and am not sure what may be causing it to stop at this point. – mah271 Feb 16 '16 at 03:29
Start a new question or work some Google-fu. Good luck. – Jonathan Carroll Feb 16 '16 at 03:30
@JonathanCarroll: thanks - see comment above, still doesn't work, but am trying other things. By chance did you do this on a Mac? – mah271 Feb 22 '16 at 03:33
No, sorry. Linux mainly, occasionally Windows. – Jonathan Carroll Feb 22 '16 at 03:59

score 0 · Answer 3 · answered Oct 24 '16 at 16:37

Building off of what JackStat outlined above, I made a modification to the page determination scheme to pick up units where there are less than 5 pages (JackStat's algorithm will throw an error). I also set it up with an import to pull in which units of interest are to be tracked. There are comments added for steps to get this to run on a Windows PC.

library(RSelenium)
library(XML)
library(foreach)

### Insure that the selenium-server-standalone.jar file and the Google Chrome driver are in the same folder
### as the Windows command directory setting
### Open Windows command
### Type in "java -jar selenium-server-standalone.jar" and hit enter

setwd("H:/heritage_units")
hu <- read.table("hu_tracked_101316.csv", sep = ",", header = TRUE, colClasses = "character")
hu.c <- hu[, 1]

# Start Selenium server
checkForServer()
startServer()

remDr <- 
    remoteDriver(
            remoteServerAddr = "localhost" 
            , port = 4444
            , browserName = "chrome"
    )

remDr$open()

master <- data.frame('Spotted On'=factor(), 'Location'=factor(), 'Directon'=factor(), 'Train No'=factor(), 'Leading'=factor(), 'Spotter Reputation'=factor(), 'Heritage Unit'=character()) 

for (u in seq_along(hu.c)) {
    url <- paste("https://www.heritageunits.com/Locomotive/Detail/", hu.c[u], sep="")
    print(hu.c[u])

    # Navigate to page
    remDr$navigate(url)

    # Snag the html
    outhtml <- remDr$findElement(using = 'xpath', "//*")
    out<-outhtml$getElementAttribute("outerHTML")[[1]]

    # Parse with RCurl
    doc<-htmlParse(out, encoding = "UTF-8")

    # get the last page so we can cycle through
    PageNodes <- getNodeSet(doc, '//*[(@id = "history_paginate")]')
    Pages <- sapply(X = PageNodes, FUN = xmlValue)
    # Find horizontal ellipsis in page information
    sc <- 0
    for (j in 1:nchar(Pages)){
            if (!(grepl("[[:alpha:]]", substr(Pages, j, j)) | grepl("[[:digit:]]", substr(Pages, j, j)))){
                    sc <- j
            }
    }
    if (sc==0) {
            posN <- gregexpr(pattern ='N', Pages)
            LastPage <- substr(Pages, posN[[1]]-1, posN[[1]]-1)
    }else{
            posN <- gregexpr(pattern ='N', Pages)
            LastPage <- substr(Pages, sc+1, posN[[1]]-1)
    }

    temp1 <- readHTMLTable(doc)$history
    temp1$'Heritage Unit' <- hu.c[u]
    for (i in 2:LastPage){
            nextpage <- remDr$findElement("css selector", '#history_next')
            nextpage$sendKeysToElement(list(key ="enter"))

            # Take it slow so it gets each page
            Sys.sleep(.50)

            outhtml <- remDr$findElement(using = 'xpath', "//*")
            out<-outhtml$getElementAttribute("outerHTML")[[1]]

            # Parse with RCurl
            doc<-htmlParse(out, encoding = "UTF-8")
            temp2 <- readHTMLTable(doc)$history
            temp2$'Heritage Unit' <- hu.c[u]
            temp1 <- rbind(temp1, temp2)
    }
    master <- rbind(master, temp1)
}

write.csv(master, "hu_sel_date.csv")

R scraping a table on https site

3 Answers3