4

I am wanting to fill in a web form and submit my query and download the resulting data. Some of the fields have the option of a drop-down menu or typing in a search query, sections can also be left blank (if all sections are left blank the entire database is downloaded), hitting the "search and download" button should instigate the downloading of a file.

Here is what I have tried (selecting all records for species "Salmo salar") based on this question. I used my browser (Opera) "Developer Tools" to inspect page elements and identify the names of all the possible fields:

library(httr)

url <- "https://nzffdms.niwa.co.nz/search"

fd <- list(
  search_catchment_no_name = "",
  search_river_lake = "",
  search_sampling_locality = "",
  search_fishing_method = "",
  search_start_year = "",
  search_end_year = "",
  search_species  = "Salmo salar", # species of interest
  search_download_format = 1,      # select csv file format
  submit = "Search and Download"
)

POST(url, body = fd, encode = "form")

I had hoped this would result in a csv file being downloaded (all records for species "Salmo salar"), but no file downloads (but outputs this (list of 10, just showing the first bit):

Response [https://nzffdms.niwa.co.nz/search]
Date: 2019-10-02 23:35
Status: 200
Content-Type: text/html; charset=utf-8
Size: 19.1 kB
<!DOCTYPE html>  
  <html>  
  <head>  
  <meta http-equiv="Content-Type" content="text/html; c...
    <meta name="title" content="NZ Freshwater Fish Database...
<meta name="description" content="NIWA NZ Freshwater Fish...
<meta name="keywords" content="NIWA, NZ, Freshwater Fish" />
<meta name="language" content="en" />
<meta name="robots" content="index, follow />

...

Edit

I think the issue is with how I am calling the Search and download button, when inspecting the web-page most fields look like this:

# end year field
<input maxlength="4" class="form-control" type="text" name="search[end_year]" id="search_end_year">

But the search and download button elements don't have a name or id option:

<input type="submit" value="Search and Download" class="btn btn-primary btn-md">

Also I have just noticed there is a hidden field, maybe I need to define this?

<input type="hidden" name="search[_csrf_token]" value="d1530f09c1ce8110b5163bd100cb0d67" id="search__csrf_token">

Any advice on how I can get the file downloading would be much appreciated.

flee
  • 860
  • 2
  • 13
  • 30

1 Answers1

3

First, check robots.txt on the website. It is commented out as of Oct 3, 2019.

Then read the terms and conditions on https://nzffdms.niwa.co.nz/terms and https://www.niwa.co.nz/freshwater-and-estuaries/nzffd/user-guide/tips and make sure you obey the terms and conditions.

And it is also important to throttle the request below.

After checking all the terms and conditions, you can use the code below to query for your data:

library(httr)
library(xml2)

gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text"))     #doc <- read_html(gr) #this works as well
getTbl <- function(x) {
    do.call(rbind, lapply(xml_find_all(doc, paste0(".//select[@name='search",x,"']/option")),
        function(n) data.frame(NAME=xml_text(n), VALUE=xml_attr(n, "value"))))
}
fishing_method <- getTbl("[fishing_method]")
species <- getTbl("[species][]")
csrf_token <- xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

fd <- list(
    "search[catchment_no_name]"="",
    "search[river_lake]"="",
    "search[sampling_locality]"="",
    "search[fishing_method]"="",
    "search[species][]"="",
    "search[species][]"=68,
    "search[start_year]"="",
    "search[end_year]"="",
    "search[download_format]"="1",
    "search[_csrf_token]"=csrf_token
)
r <- POST("https://nzffdms.niwa.co.nz/doSearch", body=fd, encode="form")
read.csv(text=content(r, "text", encoding="UTF-8"))

output:

   card m    y catchname  catch        locality time  org map    east   north altitude penet fishmeth effort pass spcode abund number minl maxl  nzreach
1  3964 1 1981   Waiau R 797.49       Lake Gunn   NA niwa d41 2122400 5581200      477   225      ang     NA   NA salsal    NA     NA   NA   NA 15006671
2  3965 1 1981   Waiau R 797.49     Lake Fergus   NA niwa d41 2123700 5584400      483   229      ang     NA   NA salsal    NA     NA   NA   NA 15006092
3 15975 1 2003   Waiau R 797.40 Excelsior Creek 1330 niwa d44 2095800 5495800      190    94      efp     80    1 salsal    NA      2  102  105 15030686
4 50772 1 1940   Waiau R 797.49 Upukerora River   NA  unk d43 2098500 5519900      210   146      unk     NA   NA salsal    NA     NA   NA   NA 15020897
chinsoon12
  • 23,550
  • 4
  • 20
  • 30
  • Thank you. Obviously a bit more to it than I originally thought. A couple questions, 1) if robots.txt was not commented out (i.e. it was active), then I would not be able to get the data via R? 2) Re checking the T&C's do some databases have rules not allowing downloading via another program? 3) Where in your code does the "throttling" occur? Or is that only if I want to send many repeated requests? Thank you very much. – flee Oct 03 '19 at 04:01
  • 1) if it was not commented out, you will need to read it to see if you are allowed to query it programmatically 2) it depends on the terms and conditions, some websites has a strict restriction against use of machines to query their website (usually there is an API provided for querying data properly) 3) you can insert some `Sys.sleep()` between your multiple calls so that the server is not overloaded. – chinsoon12 Oct 03 '19 at 06:05
  • thanks a lot for the great answer, i learned a lot. If its ok, i would have a follow up Question, but it would be too Long for a comment here, so i posted a new Question: https://stackoverflow.com/questions/58219503/difference-between-read-htmlurl-and-read-htmlcontentgeturl-text. Would be very greatful if you could take a look in case you find the time! – Tlatwork Oct 03 '19 at 12:54
  • May i ask if you have a hint for this question: https://stackoverflow.com/questions/58805740/using-r-to-download-data-automatically/58924120#58924120. I tried to solve it with session specific request body, cookies and request headers. But the request still does not go through. (Maybe its also not wanted from the site owner to have a automated request, that would also be a good answer). Here is my best attempt: https://stackoverflow.com/a/58924120/3502164. Maybe you have a small hint? – Tonio Liebrand Nov 18 '19 at 22:58
  • @BigDataScientist Yeah you are on the right track but missing something in your POST queries. You need to parse out the __VIEWSTATE after the first GET as it’s a aspx. You need to pass this in the next POST query. If I am not wrong you need to send at least 3 requests. – chinsoon12 Nov 18 '19 at 23:07
  • You can also find some of my old answers using __VIEWSTATE is:answer user:myID that does this web scraping – chinsoon12 Nov 18 '19 at 23:09