1

I have a script that parses Yahoo Finance's historical pricing data for a vector of ticker symbols. It also uses the date codes in the url for the timeframe from 1/1/2014 to yesterday. No issues getting it to work, but I'm only getting the first 100 rows. It appears the problem is that Yahoo Finance (even with a large data range selected) will only show the first 100 results until you scroll down. Is there a work around?

You can see the issue going here...

#Example to test...
Ticker <- c("AMZN","F")
maxDate <- 1548918000

for (s in Ticker){
      url <- paste('https://finance.yahoo.com/quote/',s, '/history?period1=1388559600&period2=',maxDate,'&interval=1d&filter=history&frequency=1d',sep="")
       webpage <- readLines(url,warn=FALSE)
      html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
       tableNodes <- getNodeSet(html, "//table")
      assign(s, readHTMLTable(tableNodes[[1]],

header=c("Date","Open","High","Low","Close","Adj. Close","Volume")))
      df <- get(s)
      df['Symbol'] <- s
      assign(s, df)
 }

tickerDataList <- cbind(mget(Ticker))
tickerData <- do.call(rbind, tickerDataList)

The expected results would be the same but with a date range back to 1/1/14. This would mean there would be a couple thousand rows vs. two-hundred.

Julius Vainora
  • 44,018
  • 9
  • 79
  • 96
Lindon
  • 55
  • 5

1 Answers1

0

We may utilize what this answer proposes. For instance,

library(RSelenium)
library(rvest)
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate("https://finance.yahoo.com/quote/AMZN/history?period1=1388559600&period2=1548918000&interval=1d&filter=history&frequency=1d")

for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,", i * 10000,");"))
  Sys.sleep(3)    
}

page_source <- remDr$getPageSource()
out <- read_html(page_source[[1]]) %>% html_nodes("table") %>% html_table()
nrow(out[[1]])
# [1] 801

801 lines is still not all you need, but scrolling more times than 5 (and perhaps increasing 10000) would ultimately give you the result.

Julius Vainora
  • 44,018
  • 9
  • 79
  • 96
  • Oh, yes. I certainly should have clarified. I get an error when using the getSymbols command - due to our firewall or something preventing us from connecting to an API or something. Either way, we had to parse the data. – Lindon Feb 01 '19 at 14:33
  • My thinking is that the only way to get it to work will be to loop it through 100 days at a time? – Lindon Feb 01 '19 at 14:34
  • This is very helpful. It unfortunately doesn't work for the same reason I can't connect to the quantmod for getSymbols...issue with the API timing out. I have a dataset where I could lo – Lindon Feb 01 '19 at 18:45
  • @Lindon, I'm no expert about this, but somehow I suspect that it's more about user rights rather than firewall. – Julius Vainora Feb 01 '19 at 19:13