7

I'm trying to scrape table data on a webpage using R (package rvest). To do that, the data needs to be in the html source file (that's where rvest looks for it apparently), but in this case it isn't.

However, data elements are shown in the Inspect panel's Elements view:

The table's elements are shown in the Elements view of the Inspect panel

Source file shows an empty table:

view-source file shows an empty table

Why is the data shown on inspect element but not on the source file? How can I acces the table data on html format? If I can't access through html how do I change my web scraping strategy?

*The web page is https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST

Source file: view-source:https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST


EDIT: a solution using R is appreciated

Rachel Gallen
  • 25,819
  • 19
  • 69
  • 75
David Jorquera
  • 1,283
  • 6
  • 28
  • https://www.codementor.io/codementorteam/how-to-scrape-an-ajax-website-using-python-qw8fuitvi How to Scrape an AJAX Website using Python – Progs Dec 13 '18 at 20:54
  • Thanks, but I'm looking for an R tool – David Jorquera Dec 13 '18 at 20:57
  • 1
    The page URLs you posted do not work: `La funcionalidad Excel dinámico será descontinuada a partir del 31 de Octubre de 2018` . Translation: "The dynamic Excel function will be discontinued October 31, 2018." – Old Pro Dec 14 '18 at 20:53
  • @OldPro I don't know why it throws you away... though you can enter through: https://si3.bcentral.cl/siete/secure/cuadros/arboles.aspx and there select on the left menu "Información histórica" -> "Variación mensual". That's the table I want to get. – David Jorquera Dec 15 '18 at 12:57

6 Answers6

4

The data is more than likely loaded dynamically from a data source or API. You can scrape the filled table by sending a GET request to the web page and scraping the page after the data has been loaded!

Brady Ward
  • 116
  • 7
  • 1
    The `GET` request would just load the original page source, _as text_. If the table is being built dynamically it won't be there. You have to also execute the Javascript on the page to load that data _and_ you have to build the resulting DOM. – Stephen P Dec 13 '18 at 21:10
  • @StephenP I think you are the closest to target, do you know any technique to do that in R? – David Jorquera Dec 13 '18 at 22:29
  • @DavidJorquera - no, I don't know R at all... and I don't know of packages in any language that I _do_ know. You essentially have to build a web browser, minus the rendering and user-controls; you need the HTML parser, DOM builder, and full browser-compliant Javascript environment. – Stephen P Dec 14 '18 at 00:20
  • @DavidJorquera https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/ – Brady Ward Dec 14 '18 at 13:46
  • 1
    @DavidJorquera note that the link Brady Ward gave you is about using PhantomJS. Although it was the best option in 2016, PhantomJS support was [abandoned](https://github.com/ariya/phantomjs/issues/15344) in early 2018 (the last supported release is 2.1.1 from January 2016) because by then [headless chrome](https://developers.google.com/web/updates/2017/04/headless-chrome) was a better option with better support for ongoing development and support. So don't use PhantomJS for new projects. – Old Pro Dec 17 '18 at 23:36
3

I rly wish 'experts' would stop with the "you need Selenium/Headless Chrome" since it's almost never true and introduces a needless, heavyweight third-party dependency into data science workflows.

The site is an ASP.NET site so it makes heavy use of sessions and the programmers behind this particular one force that session to start at the home ("Hello, 2000 called and would like their session state preserving model back.")

Anyway, we need to start there and progress to your page. Here's what that looks like to your browser:

enter image description here

We can also see from is that the site returns lovely JSON so we'll eventually grab that. Let's start modeling an R httr workflow like the above session:

library(xml2)
library(httr)
library(rvest)

Start at the, um, start!

httr::GET(
  url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
  httr::verbose()
) -> res

Now, we need to get the HTML from that page as there are a number of hidden values we need to supply to the POST that is made since that's part of how the brain-dead ASP.NET workflow works (again, follow the requests in the image above):

pg <- httr::content(res)

hinput <- html_nodes(pg, "input")
hinput <- as.list(setNames(html_attr(hinput, "value"), html_attr(hinput, "name")))
hinput$`header$txtBoxBuscador` <- ""
hinput$`__EVENTARGUMENT` <- ""
hinput$`__EVENTTARGET` <- "lnkBut01"

httr::POST(
  url = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx",
  httr::add_headers(
    `Referer` = "https://si3.bcentral.cl/Siete/secure/cuadros/home.aspx"
  ),
  encode = "form",
  body = hinput
) -> res

Now we've done what we need to con the website into thinking we've got a proper session so let's make the request for the JSON content:

httr::GET(
  url = "https://si3.bcentral.cl/siete/secure/cuadros/actions.aspx",
  httr::add_headers(
    `X-Requested-With` = "XMLHttpRequest"
  ),
  query = list(
    Opcion = "1",
    idMenu = "IPC_VAR_MEN1_HIST",
    codCuadro = "IPC_VAR_MEN1_HIST",
    DrDwnAnioDesde = "",
    DrDwnAnioHasta = "",
    DrDwnAnioDiario = "",
    DropDownListFrequency = "",
    DrDwnCalculo = "NONE"
  )
) -> res

And, boom:

str(
  httr::content(res), 1
)

## List of 32
##  $ CodigoCuadro       : chr "IPC_VAR_MEN1_HIST"
##  $ Language           : chr "es-CL"
##  $ DescripcionCuadro  : chr "IPC, IPCX, IPCX1 e IPC SAE, variación mensual, información histórica"
##  $ AnioDesde          : int 1928
##  $ AnioHasta          : int 2018
##  $ FechaInicio        : chr "01-01-2010"
##  $ FechaFin           : chr "01-11-2018"
##  $ ListaFrecuencia    :List of 1
##  $ FrecuenciaDefecto  : NULL
##  $ DrDwnAnioDesde     :List of 3
##  $ DrDwnAnioHasta     :List of 3
##  $ DrDwnAnioDiario    :List of 3
##  $ hsDecimales        :List of 1
##  $ ListaCalculo       :List of 1
##  $ Metadatos          : chr " <img runat=\"server\" ID=\"imgButMetaDatos\" alt=\"Ver metadatos\" src=\"../../Images/lens.gif\" OnClick=\"jav"| __truncated__
##  $ NotasPrincipales   : chr ""
##  $ StatusTextBox      : chr ""
##  $ Grid               :List of 4
##  $ GridColumnNames    :List of 113
##  $ Paginador          : int 15
##  $ allowEmptyColumns  : logi FALSE
##  $ FechaInicioSelected: chr "2010"
##  $ FechaFinSelected   : chr "2018"
##  $ FrecuenciaSelected : chr "MONTHLY"
##  $ CalculoSelected    : chr "NONE"
##  $ AnioDiarioSelected : chr "2010"
##  $ UrlFechaBase       : chr "Indizar_fechaBase.aspx?codCuadro=IPC_VAR_MEN1_HIST"
##  $ FechaBaseCuadro    : chr "Ene 2010"
##  $ IsBoletin          : logi FALSE
##  $ CheckSelected      :List of 4
##  $ lnkButFechaBase    : logi FALSE
##  $ ShowFechaBase      : logi FALSE

Dig around in the JSON for the data you need. I think it's in the Grid… elements.

hrbrmstr
  • 71,487
  • 11
  • 119
  • 180
  • While this may work today, this kind of scripting is very fragile, meaning easy to break and likely to break when the site changes. And when it does break, it will be difficult to understand why and difficult to fix unless you are an expert at HTML and AJAX. Scripting a headless browser is much more robust (likely not to break when the site changes) and when it fails, much easier to understand why it fails and fix, because it closely follows the user experience of using the site. – Old Pro Dec 17 '18 at 22:11
  • And Selenium isn't? Seriously? Nice downvote tho. And ASP.NET has worked the same way for ages. Detailed web scraping is fragile. Period. And, dependencies are terrible things. – hrbrmstr Dec 17 '18 at 22:13
  • 1
    Navigating a website by having Selenium follow named links in Chrome is about as robust as screen scraping gets, both because sites do not like to confuse users by changing link names and because that's how most automated QA testing is done. And once you've done it once, it is easy to extend to do all the other tables on the site and to any other site you need client-side rendering for. I would much sooner just manually save the page (or the JSON response of the AJAX call) straight from my browser and parse that than use your solution. – Old Pro Dec 17 '18 at 23:21
  • We have a winner! with all downsides taken into account, this just works. – David Jorquera Dec 19 '18 at 22:55
2

Your target is a complex, dynamic website, which is why you cannot easily scrape it. To get to the page I think you are asking about, I have to first go to the home page, then click on "Cuentas Nacionales" on the left menu. That click causes a POST request sending form data apparently indicating the next view to present, which is apparently stored on the server side in a session. This is why you cannot directly access the target URL; it is the same URL for several different displays.

In order to scrape the page, you are going to need to script a browser to go through the steps to get to the page and then save the rendered page to an HTML file, at which point you should be able to use rvest to extract the data from the file. (@hrbrmstr points out that you do not absolutely need to script a browser to get the data, since you do not need to get the data by scraping a rendered page. More on that later.)

At this point in time (December 2018), PhantomJS has been deprecated and the best recommendation is to use headless chrome. To script it sufficiently to navigate through a multi-page site, you use Selenium WebDriver with ChromeDriver to control headless chrome. See this answer for a fully worked out explanation of how to get this working with a Python script. The Selenium documentation includes information for how to use other programming languages, including Java, C#, Ruby, Perl, PHP, and JavaScript, so use whichever language you are comfortable with.

The general outline of the script (with Python snippets) would be

  • Start chrome in headless mode
  • Fetch the home page
  • Wait for the page to fully load. I'm not sure the best way to do that in this case, but probably you can poll the page looking for the table data to be filled in and wait until you find it. See Selenium explicit and implicit waits.
  • Find the link by Link Text link = driver.find_element_by_link_text("Cuentas Nacionales")
  • Click the link link.click()
  • Again, wait for the page to load
  • Get the HTML using driver.getPageSource() and save it to a file.
  • Feed that file into rvest

It looks like it may be possible to do all this from within R using seleniumPipes. See its documentation for how to accomplish the above steps. Use findElement("link text", "Cuentas Nacionales") %>% elementClick to find and click the link. Then use getPageSource() to get the page source and feed that into rvest or XML or something to find and parse the table.

Side note: @hrbrmstr points out that instead of scripting a browser to scrape the page, you could manually go through all the steps in the browser, extract the relevant requests and response data using the browser's development tools, so that you can eventually script a set of HTTPS requests and response parsers that will eventually generate a request that will return the data you want. Since hrbrmstr has done that for you already, it will be easier for you in this exact instance to cut and paste their answer, but in general I do not recommend that approach as it is difficult to set up, very likely to break in the future, and difficult to fix when it does break. And for people who don't care about long-term maintainability, since this table only changes monthly, you could even more easily just manually navigate to the page and use the browser to save it to an HTML file and the load that file into the R script.

Old Pro
  • 22,324
  • 4
  • 52
  • 96
  • Thank you for this. Actually from the home, you have to go to "Precios" and then "Información Histórica" and finally "Variación mensual". But surely your explanation applies to that too. I'll try it. – David Jorquera Dec 16 '18 at 00:21
  • You do not need to script a browser to get the data. That is a very inaccurate assertion. – hrbrmstr Dec 17 '18 at 13:54
  • @hrbrmstr OK, "need" is too strong a word because it is absolute, but I meant it in the normal conversational sense of "only practical alternative" and I stand by that wrt scraping data off a web page generated in the browser by JavaScript and AJAX. Furthermore, SO is intended to provide information helpful to a wide audience beyond the original poster, so I want to provide an answer that is widely applicable. If the OP uses seleniumPipes to solve this problem, the OP will find it trivial to extend the solution to other tables on that site or to other sites. That cannot be said of your answer. – Old Pro Dec 17 '18 at 23:06
1

The data is most likely loaded through a JavaScript framework, so the original source is changed by JavaScript.

You would need a tool that can execute the JavaScript and then scrap the result for the Data. Or you may be able to call the Data API directly and get the results in JSON.

EDIT: I have had some success with using Microsoft PowerBI to scrape web tables, here is a link to an example if it works for you. https://www.poweredsolutions.co/2018/05/14/new-web-scraping-experience-in-power-bi-power-query-using-css-selectors/

Richard Hubley
  • 1,822
  • 18
  • 23
1

As others have statement, table data is probably loaded dynamically by javascript.

  • You can either search through network tab in developer tools and maybe find what request returns the data you need. Then instead of analyzing html of main document you will probably get some JSON from other URL with parameters. XML/HTML and other formats are possible too. If authorization is needed you probably have to recreate all http request headers too.
  • Or try integrating something like Selenium into your script - which will use real browser that executes js scripts. Its mainly used for testing but should also work for scrapping data. There is probably also option of using some headless browsers along the way if opening a new window isn't welcome :) Apperently there is already library that integrates selenium with R - good luck :) Scraping with Selenium - R-bloggers
zworek
  • 48
  • 5
1

This is possible to do with rvest because the final iframe uses a standard form. In order to use just rvest you have to leverage a session, a user agent string, and the information you already have collected regarding the direct links to the iframe.

library(rvest)
library(httr)

# Change the User Agent string to tell the website into believing this is a legitimate browser
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

# Load the initial session so you don't get a timeout error
session <- html_session("https://si3.bcentral.cl/siete/secure/cuadros/home.aspx", user_agent(uastring))

session$url

# Go to the page that has the information we want
session <- session %>%
  jump_to("https://si3.bcentral.cl/Siete/secure/cuadros/arboles.aspx")

session$url

# Load only the iframe with the information we want
session <- session %>%
  jump_to("https://si3.bcentral.cl/siete/secure/cuadros/cuadro_dinamico.aspx?idMenu=IPC_VAR_MEN1_HIST&codCuadro=IPC_VAR_MEN1_HIST")

session$url
page_html <- read_html(session)

# Next step would be to change the form using html_form(), set_values(), and submit_form() if needed.
# Then the table is available and ready to scrape.
settings_form <- session %>% 
  html_form() %>%
  .[[1]] 

# Form on home page has no submit button,
# so inject a fake submit button or else rvest cannot submit it.
# When I do this, rvest gives a warning "Submitting with '___'", where "___" is
# often an irrelevant field item.
# This warning might be an rvest (version 0.3.2) bug, but the code works.
fake_submit_button <- list(name = NULL,
                           type = "submit",
                           value = NULL,
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "input"
settings_form[["fields"]][["submit"]] <- fake_submit_button

settings_form <- settings_form %>%
  set_values(DrDwnAnioDesde = "2017",
             DrDwnAnioDiario = "2017")

session2 <- session %>%
  submit_form(settings_form)
Adam Sampson
  • 1,418
  • 1
  • 4
  • 14
  • Thinking about this and I realized you might only be able to submit one drop-down option at a time. In that case you would set one option and submit. Then set another option and submit. But I'm not sure whether that is required here. – Adam Sampson Dec 17 '18 at 15:54
  • 1
    This won't render the table b/c said rendering is done with JavaScript. – hrbrmstr Dec 17 '18 at 22:14