R data scraping / crawling with dynamic/multiple URLs

Question

I try to get all decrees of the Federal Supreme Court of Switzerland available at: https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=&to_date=&x=12&y=12 Unfortunately, no API is provided. The CSS selectors of the data I want to retrieve is .para

I am aware of http://relevancy.bger.ch/robots.txt.

User-agent: *
Disallow: /javascript
Disallow: /css
Disallow: /hashtables
Disallow: /stylesheets
Disallow: /img
Disallow: /php/jurivoc
Disallow: /php/taf
Disallow: /php/azabvger
Sitemap: http://relevancy.bger.ch/sitemaps/sitemapindex.xml
Crawl-delay: 2

To me it looks like the URL i am looking at is allowed to crawl, is that correct? Whatever, the federal cort explains that these rules are targeted to big search engines and individual crawling is tolerated.

I can retrieve the data for a single decree (using https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/)

url <- 'https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&page=1&from_date=&to_date=&sort=relevance&insertion_date=&top_subcollection_aza=all&query_words=&rank=1&azaclir=aza&highlight_docid=aza%3A%2F%2F18-12-2017-6B_790-2017&number_of_ranks=113971'

webpage <- read_html(url)

decree_html <- html_nodes(webpage,'.para')

rank_data <- html_text(decree_html)

decree1_data <- html_text(decree_html)

However, since rvest extracts data from only one specific page and my data is on multiple pages, I tried Rcrawler to do so (https://github.com/salimk/Rcrawler), but I do not know how to crawl the given site structur on www.bger.ch to get all URLs with the decrees.

I checked out following posts, but could still not find a solution:

R web scraping across multiple pages

Rvest: Scrape multiple URLs

This does not look like a static site. You may get better luck with {Rselenium} or {decapitated} — dmi3kno, Dec 22 '17 at 17:51

hrbrmstr · Accepted Answer · 2017-12-22T23:17:45.577

I don't do error handling below since that's beyond the scope of this question.

Let's start with the usual suspects:

library(rvest)
library(httr)
library(tidyverse)

We'll define a function that will get us a page of search results by page number. I've hard-coded the search parameters since you provided the URL.

In this function, we:

get the page HTML
get the links to the documents we want to scrape
get document metdata
make a data frame
add attributes to the data frame for page number grabbed and whether there are more pages to grab

It's pretty straightforward:

get_page <- function(page_num=1) {

  GET(
    url = "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php",
    query = list(
      type="simple_query",
      lang="de",
      top_subcollection_aza="all",
      query_words="",
      from_date="",
      to_date="",
      x="12",
      y="12",
      page=page_num
    )
  ) -> res

  warn_for_status(res) # shld be "stop" and you should do error handling

  pg <- content(res)

  links <- html_nodes(pg, "div.ranklist_content ol li")

  data_frame(
    link = html_attr(html_nodes(links, "a"), "href"),
    title = html_text(html_nodes(links, "a"), trim=TRUE),
    court = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'court')]"), trim=TRUE), # these are "dangerous" if they aren't there but you can wrap error handling around this
    subject = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'subject')]"), trim=TRUE),
    object = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'object')]"), trim=TRUE)
  ) -> xdf

  # this looks for the text at the bottom paginator. if there's no link then we're done

  attr(xdf, "page") <- page_num
  attr(xdf, "has_next") <- html_node(pg, xpath="boolean(.//a[contains(., 'Vorwärts')])")

  xdf

}

Make a helper function since I can't stand typing attr(...) and it reads better in use:

has_next <- function(x) { attr(x, "has_next") }

Now, make a scraping loop. I stop at 6 just b/c. You should remove that logic for scraping everything. Consider doing this in batches since internet connections are unstable things:

pg_num <- 0
all_links <- list()

repeat {
  cat(".") # poor dude's progress ber
  pg_num <- pg_num + 1
  pg_df <- get_page(pg_num)
  if (!has_next(pg_df)) break
  all_links <- append(all_links, list(pg_df))
  if (pg_num == 6) break # this is here for me since I don't need ~11,000 documents
  Sys.sleep(2) # robots.txt crawl delay
}
cat("\n")

Turn the list of data frames into one big one. NOTE: You should do validity tests before this since web scraping is fraught with peril. You should also save off this data frame to an RDS file so you don't have to do it again.

lots_of_links <- bind_rows(all_links)

glimpse(lots_of_links)
## Observations: 60
## Variables: 5
## $ link    <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&...
## $ title   <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "...
## $ court   <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic...
## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec...
## $ object  <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi...

With all the link in hand, we'll get the documents.

Define a helper function. NOTE we aren't parsing here. Do that separately. We'll store the inner content <div> HTML text so you can parse it later.

get_documents <- function(urls) {
  map_chr(urls, ~{
    cat(".") # poor dude's progress ber
    Sys.sleep(2) # robots.txt crawl delay 
    read_html(.x) %>% 
      xml_node("div.content") %>% 
      as.character() # we do this b/c we aren't parsing it yet but xml2 objects don't serialize at all
  })
}

Here's how to use it. Again, remove head() but also consider doing it in batches.

head(lots_of_links) %>% # I'm not waiting for 60 documents
  mutate(content = get_documents(link)) -> links_and_docs
cat("\n")

glimpse(links_and_docs)
## Observations: 6
## Variables: 6
## $ link    <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&...
## $ title   <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "...
## $ court   <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic...
## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec...
## $ object  <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi...
## $ content <chr> "<div class=\"content\">\n      \n<div class=\"para\"> </div>\n<div class=\"para\">Bundesgericht </div>...

You still need error & validity checking in various places and may need to re-scrape pages if there are server errors or parsing issues. But this is how to build a site-specific crawler of this nature.

Dear hrbrmstr, many thanks for your code!! I get an error when I run the scraping loop: `Error in has_next(pg_df) : object 'xdf' not found` — captcoma, Dec 22 '17 at 22:40
Apologies. I changed the function definition line to `has_next — hrbrmstr, Dec 22 '17 at 23:19
`if (!has_next(pg_df)) break` still stopps the loop (despite of a link at the bottom paginator) without `!` it works. However, since one search is limited to 2000 hits (warning pops up when you klick on "Vorwärts" [here](https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&page=200&from_date=&to_date=&sort=relevance&insertion_date=&top_subcollection_aza=all&query_words=) I will use `if (pg_num == 200) break` to stop and adjust the `from_date=` to do the whole process in batches, as you recomended. Many thanks again, your code is very inspiring! — captcoma, Dec 23 '17 at 00:28
you mentioned a wrap error handling of the `html_nodes`. I tried to implement your nice [wrap error handling function](https://stackoverflow.com/questions/30721519/rvest-package-is-it-possible-for-html-text-to-store-an-na-value-if-it-does-n) by `object = html_text_na(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'object')]"), trim=TRUE)`, however I still get the same error: `Error: Column object must be length 1 or 10, not 9` which looks to me like in 1 of 10 cases the node object is missing. — captcoma, Dec 23 '17 at 13:11
there are scads of answers on SO that show the alternate way of handling this (where there's a missing element in an xpath selector). — hrbrmstr, Dec 23 '17 at 14:14

R data scraping / crawling with dynamic/multiple URLs

1 Answers1

Linked