0

I'm trying to webscrape this page: https://projects.worldbank.org/en/projects-operations/project-detail/P171821

In particular, I want to extract the Associated Projects id (in this case it is P145196).

I'm using rvest package.

I've tried the following code, but this has resulted in an empty vector.

library(rvest)

simple <- read_html("https://projects.worldbank.org/en/projects-operations/project-detail/P171821") %>%
      html_nodes(".ng-tns-c2-0") %>%
      html_text()

> simple
character(0)

What am I doing incorrectly?

Ken Lee
  • 109
  • 5
  • 1
    click view source in your browser. do you see ng-tns-c2-0 there? I don't – Sirius May 04 '21 at 17:56
  • No, I don't either. I used Developer Tools (Ctrl+shift+I) to inspect the item. This is how I found the node. – Ken Lee May 04 '21 at 18:00
  • Weird; I've tried searching the view source, but can't find most of the contents of the page. How is that possible? – Ken Lee May 04 '21 at 18:05
  • 1
    `rvest` can only read the page source. What you see in your browser was probably generated with javascript. `rvest` cannot run javascript. If you need to interact with page that use javascript, then you'll need to use something like RSelenium. – MrFlick May 04 '21 at 18:05
  • 1
    the site runs JS, to fetch data to populate the page with. You need to either make R run js on that page (hard), or figure out what goes on and reverse engineer the process (likely easier). Tip: watch the network tab in developer tools – Sirius May 04 '21 at 18:05

1 Answers1

1

Turns out it comes from this json url:


library(jsonlite)
json_data <- fromJSON("https://search.worldbank.org/api/v2/projects?format=json&fl=*&id=P171821&apilang=en")
json_data$projects[["P171821"]]$parentprojid

Output:


> json_data$projects[["P171821"]]$parentprojid
[1] "P145196"

You can easily update that url to query other project ids

Sirius
  • 4,339
  • 1
  • 10
  • 17
  • That's brilliant, thanks. I'm a bit new to R and am wondering how to automatise this. Suppose that I have a dataframe whereby column A has project IDs, and column B has URL for the corresponding project. Now I would like to create a col C that would scrape the projID of the parent project. How could I do this? I will include a sample data set in my original post. Could you have a look? – Ken Lee May 04 '21 at 18:18
  • 1
    You should start by asking this in a new question – Sirius May 04 '21 at 18:20
  • I've asked a new question (https://stackoverflow.com/questions/67391080/how-to-automate-webscraping). Could you please have a look? – Ken Lee May 04 '21 at 19:57
  • I saw it, prepared the answer, but it got deleted – Sirius May 04 '21 at 21:14
  • My apologies... I undeleted the question. Could you still post the answer? I thought that it received no attention... – Ken Lee May 04 '21 at 21:17