1

I'm trying to scrape data, in the form of a table, and the hrefs from within that table from a town assessor website using the R package rvest. Despite having luck scraping tables from other websites (e.g. wikipedia), I'm unable to get anything from the town assessor.

I am using RStudio v1.1.442 and R v3.5.0.

sessioninfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_0.3.3 xml2_1.2.0  V8_2.2     

loaded via a namespace (and not attached):
 [1] httr_1.4.0     compiler_3.5.0 selectr_0.4-1  magrittr_1.5   R6_2.4.0       tools_3.5.0    yaml_2.2.0    
 [8] curl_3.3       Rcpp_1.0.1     stringi_1.4.3  stringr_1.4.0  jsonlite_1.6   

I have tried to follow a few examples. First, the wikipedia state population example, which works fine.

url <- "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population"
population <- url %>%
  read_html() %>%
  html_nodes("#mw-content-text > div > table:nth-child(11)") %>%
  html_table()
population <- population[[1]]

I've also been able to scrape data from yelp without issue. This, for example, gives me the names of the restaurants.

url <- "https://www.yelp.com/search?find_loc=New+York,+NY,+USA"
heading <- url %>%
  read_html() %>%
  html_nodes(".alternate__373c0__1uacp .link-size--inherit__373c0__2JXk5") %>%
  html_text()

The website I'm having trouble with is like this one, which is the output of a search for properties on a specific street.

url <- "https://imo.ulstercountyny.gov/viewlist.aspx?sort=printkey&swis=all&streetname=Lake+Shore+Dr"
helpme <- url %>%
  read_html() %>%
  html_nodes("#tblList > tbody") %>%
  html_table()

I would also like to be able to pull out the hrefs using something like this

helpme <- url %>%
  read_html() %>%
  html_nodes("#tblList > tbody") %>%
  html_attr('href') %>%
  html_text()

Unfortunately, my attempts to scrape the table and the href are empty.

Is there something strange about this website. I've used the chrome browser inspector and SelectorGadget to help find the right copy selectors. I've also tried it with the xpath. The result is the same either way.

  • Looks like it redirects to a login page. Do you have a user account or are you wanting to access via what the site calls "Public Access"? – Stuart Allen May 16 '19 at 03:14
  • 1
    See if [this answer](https://stackoverflow.com/questions/52855989/scrape-aspx-page-with-r) helps you. Or [this one](https://stackoverflow.com/questions/8357298/web-scrape-asp-net-web-site-with-r) also – R. Schifini May 16 '19 at 03:27
  • i'm using the Public Access – James Gregory May 16 '19 at 03:27
  • Public access appears to require session cookies that are created when you click "Click for public access". You need a way to manage cookies and headers across multiple GET and POST requests, which is too much for rvest. You might be able to get it working with httr, but I think RSelenium would be easiest. I was able to get the data by using [this solution](https://stackoverflow.com/questions/56118999/issue-scraping-page-with-load-more-button-with-rvest/56125902#56125902) pretty much unchanged, except for the URL and the CSS. – gersht May 16 '19 at 10:42
  • Thanks. I'll give it a shot. So far I haven't been able to get selenium to work properly. Chrome seems to close the window, maybe because it's a pop-up or something, and I can't connect to firefox. I'll keep working on this and see if I can make any progress. – James Gregory May 16 '19 at 12:31

0 Answers0