Webscraping with loop

Question

I'm trying to scrape some text from a website using a loop function but my loop function doesn't move on picking the next item in my vector list. Appreciate any helpful advice. Thanks

library(rvest)
library(xml2)


ID <- c(1:2)
Land <- c('Afghanistan','Ägypten')
url <- c('afghanistan', 'aegypten') 
Text <- (NA)

data <- data.frame(ID, Land, Text)

for(i in url) {
  nam <- paste("https://www.reporter-ohne-grenzen.de", i, sep = "/")
  assign(nam, i)

  webpage <- read_html(paste0(nam, i))
  data$Text <- i <- webpage %>% html_nodes('div.text') %>% .[[1]] %>% html_text() 
}

Hmm, not sure if I made my problem clear. Here's an example of my desired data output.

library(rvest)
library(xml2)

ID <- c(1:2)
Land <- c('Afghanistan','Ägypten')
url <- c('afghanistan', 'aegypten') 
Text <- (NA)

data <- data.frame(ID, Land, Text)


afghanistan <- 'https://www.reporter-ohne-grenzen.de/afghanistan'
afghanistan <- read_html(afghanistan)
afghanistan <- html_nodes(afghanistan,'div.text')
afghanistan <- html_text(afghanistan)[[1]]

aegypten <- 'https://www.reporter-ohne-grenzen.de/aegypten'
aegypten <- read_html(aegypten)
aegypten <- html_nodes(aegypten,'div.text')
aegypten <- html_text(aegypten)[[1]]

# desired data output
data$Text <- c(afghanistan, aegypten)

I don't want to repeat these lines for 180 countries.

aegypten <- 'https://www.reporter-ohne-grenzen.de/aegypten'
aegypten <- read_html(aegypten)
aegypten <- html_nodes(aegypten,'div.text')
aegypten <- html_text(aegypten)[[1]]

Here's the solution:

library(rvest)
library(xml2)

ID <- c(1:4) 
Land <- c('Afghanistan','Ägypten','Deutschland','Italien')
Url <- c('afghanistan', 'aegypten','deutschland','italien') 
Text <- NA

data <- data.frame(ID, Land, Text)
website <- 'https://www.reporter-ohne-grenzen.de'

for (i in ID) {
  country <- Url[i]

  html_url <- paste(website,country,sep='/')
  output <- read_html(html_url)
  output <- html_nodes(output,'div.text')
  output <- html_text(output)[[1]]

  data$Text[i] <- output
}

Why are you using `assign`? The variable `nam` should be sufficient to be used in read_html, why add the `i` again. I would suggest to use `sapply` or `lapply` because then you get the results in a list and can then just unlist it and create a data.frame from that. — hannes101, Nov 18 '19 at 10:28

FilipW · Answer 1 · 2019-11-18T12:50:27.083

Even though for loops can be really handy you usually solve iterations in R by creating functions that you then can iterate.

For this example we can put your for loop in a function and then use purrr and the function map() or in this case the daughter function map_character() inside dplyr's mutate() to store the text result in a column.

library(rvest)
#> Loading required package: xml2
library(xml2)
library(tidyverse)

ID <- c(1:2)
Land <- c('Afghanistan','Ägypten')
url <- c('afghanistan', 'aegypten') 
Text <- (NA)

data <- data.frame(ID, Land, url, Text)

read_country <- function(country_url){

nam <- paste0("https://www.reporter-ohne-grenzen.de/", country_url)

webpage <- read_html(paste0(nam))

webpage %>% html_nodes('div.text') %>% .[[1]] %>% html_text() 

}

data <- data %>% 
    mutate(Text = map_chr(url, read_country))

^{Created on 2019-11-18 by the reprex package (v0.3.0)}

Ah, I didn't know that. Just thought that it makes it easier to read what the function returns. Actually it does work(I use the `reprex` package), since the country names can be used in the URL and then redirects to the correct url, for instance `reporter-ohne-grenzen.de/Ägypten` becomes `reporter-ohne-grenzen.de/aegypten`. But I have changed it so that it is more explicit. — FilipW, Nov 18 '19 at 12:32

score 0 · Answer 2 · answered Nov 19 '19 at 01:44

Using purrr functions along with rvest, we can do

library(purrr)
library(rvest)

data$Text <- map(paste0("https://www.reporter-ohne-grenzen.de/", url),
             ~.x %>% 
                read_html %>% 
                html_nodes('div.text') %>%
                html_text %>% .[[1]]) %>% flatten_chr()

data

ID <- c(1:2)
Land <- c('Afghanistan','Ägypten')
url <- c('afghanistan', 'aegypten') 
Text <- (NA)
data <- data.frame(ID, Land, Text)

Webscraping with loop

2 Answers2