1

I'm new to R and have been trying to crawl this website: http://rera.rajasthan.gov.in/ProjectSearch

I'm trying to get the list of all projects in the table including the url to the "View" button but have been failing miserably.

The table appears once you've clicked the Search button below the form.

So far I've tried using Rvest unsuccessfully because I can't seem to find a url or a pagination change variable to try and crawl the table present on the site.

Is there a way to crawl all 788 items in the table?

Should I be using some other tool or Rselenium?

Megh
  • 41
  • 5

1 Answers1

2

You can combine RSelenium and rvest. This is a code snippet to get the links on the first page.

1) Start Selenium. The best tutorial is found here on StackOverflow: can't execute rsDriver (connection refused).

In short, install Docker and the headless browser, start docker in terminal with docker run -d -p 4445:4444 selenium/standalone-chrome

2) Then go in RStudio and use these lines to start RSelenium, get on the page, click the Search button and harvest the links:

library(RSelenium)
library(rvest)
library(tidyverse)


remDr <- remoteDriver(remoteServerAddr = "localhost", 
                      port = 4445L, 
                      browserName = "chrome")
remDr$open()

remDr$navigate("http://rera.rajasthan.gov.in/ProjectSearch")

# find and click
search <- remDr$findElement(using = "id", value = "btn_SearchProjectSubmit") # get search button
search$sendKeysToElement(list("\uE007")) # click search button

# get the html code, probably not neccessary, but I prefer it this way
html <- remDr$getPageSource() %>% .[[1]] %>% read_html()
html <- as.character(html)

# get the links
links <- html %>% read_html() %>% html_nodes("#OuterProjectGrid td a") %>% html_attr("href")

Then you should implement the pagination, e.g. with map from purrr.

kabr
  • 990
  • 1
  • 9
  • 19