0

I need to collect the links from 3 pages, each having 150 links, using R with rvest library. I used a for-loop to crawl through the pages. I know that it's a very basic question, which has been answered elsewhere: R web scraping across multiple pages Scrape and Loop with Rvest I tried different versions of the following code. Most of them worked but returned only 50 instead of 150 links

library(rvest)

baseurl <- "https://www.ebay.co.uk/sch/i.html?_from=R40&_nkw=chain+and+sprocket&_sacat=0&_pgn="
n <- 1:3
nextpages <- paste0(baseurl, n)

for(i in nextpages){
  html <- read_html(nextpages)
  links <- html %>% html_nodes("a.vip") %>% html_attr("href")
}

The code is expected to return all the 150, instead of just 50.

Sneaky
  • 27
  • 3

2 Answers2

1

You're overwriting the links variable in every iteration, so you would only end up with the last 50 links.

But you're looping using the 'i' variable, whereas your read_html() function uses the nextpages variable, which is actually a vector of 3 urls. You should be getting an error.

Try this:

links <- c()
for(i in nextpages){
  html <- read_html(i)
  links <- c(links, html %>% html_nodes("a.vip") %>% html_attr("href"))
}
user2474226
  • 1,410
  • 1
  • 9
  • 8
1

We can use map instead of a for loop.

library(rvest)
library(purrr)

map(nextpages, . %>% read_html %>%
               html_nodes("a.vip") %>% 
               html_attr("href")) %>% flatten_chr()

#[1] "https://www.ebay.co.uk/itm/Genuine-Honda-Chain-and-sprocket-set-Honda-Cub-C50-C70-C90-Heavy-Duty/254287014069?hash=item3b34afe8b5:g:wjEAAOSwqaBdH69W"         
#[2] "https://www.ebay.co.uk/itm/DID-Heavy-Duty-Drive-Chain-And-JT-Sprocket-Kit-For-Honda-MSX125-Grom-2013-2019/223130604262?hash=item33f39ed2e6:g:QmwAAOSwdrpcAQ4c"
#.....
#...
Ronak Shah
  • 286,338
  • 16
  • 97
  • 143
  • Nice. + So, looking up flatten_chr in https://purrr.tidyverse.org/reference/flatten.html seems if you had 3 levels deep you end up at two and if at two you end up at one and so on? How were you using it here? – QHarr Oct 08 '19 at 20:19
  • 1
    @QHarr `map` is like `lapply` it returns list, by using `flatten_chr` we make it into character vector. See the difference between output of `lapply(1:10, sqrt)` and `lapply(1:10, sqrt) %>% flatten_dbl()` – Ronak Shah Oct 09 '19 at 00:42