0

Good evening everyone,

I am currently trying to scrape zalando website to get the name of every products that appaears on the first two pages of the following url : (https://www.zalando.nl/damesschoenen-sneakers/)

Here is my code:

require(rvest)
require(dplyr)

url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
output <- html_nodes(x = url, css = selector_name) %>% html_text

The result is a list of 24 items while there is 86 products on the page. Has anyone encounter this issue before ? Any idea on how to solve it ? Thank you for your help.

Thomas

  • 1
    Looks like the page loads 24 items when you go to it and more load as you scroll. – Gregor Thomas Jan 29 '18 at 20:00
  • First, accessing webpages through different user agents will often yield different layouts. Second, as Gregor states, it looks very much like this is a java based layout, which will not load easily trough rvest. Check what you got by writing and loading your url object into a browser. I.e. write_html(url, file = "test_url.html") – Nicolás Velásquez Jan 29 '18 at 21:14
  • @Gregor Thank you for the comment. Any idea on how to bypass this issue ? – Thomas AMET Jan 30 '18 at 10:50
  • @NicolásVelásquez, I just replied to your comment. – Thomas AMET Jan 30 '18 at 10:56

2 Answers2

0

I just tried what Nicolas Velasqueaz suggested

url <- read_html('https://www.zalando.nl/damesschoenen-sneakers/')
write_html(url, file = "test_url.html")
selector_name <- '.z-nvg-cognac_brandName-2XZRz'
test_file <- read_html("test_url.html")
output <- html_nodes(x = test_file, css = selector_name) %>% html_text

The results are the same. I still have only 24 items that shows up. So if anyone has a solution would be very appreciated.

  • Perhaps I did not myself clear. What I had suggested was for you to check on a browser the number of shoes present in the outputted html ("test_url.html"). If that number is just 24, then it means that you will need more than just plain rvest to conduct the web wcrapping you are looking into. Look into this thread for more insights: https://stackoverflow.com/questions/29861117/r-rvest-scraping-a-dynamic-ecommerce-page – Nicolás Velásquez Jan 30 '18 at 14:41
  • @NicolásVelásquez, I also checked the number of item in the html file and they appear all. I also check the page with 'inspect ' function of my browser and all the product are embeded under the same css so I really dont know what is the issue. I thought that it might be the fact that read_html doesnt let the webpage to load entirely.. if it is the case how can I control the time of loading ? – Thomas AMET Jan 30 '18 at 16:19
  • Looking into the "test_url.html"'s code I found a java script just after the 24th node defined by css = '.z-nvg-cognac_brandName-2XZRz'. Look at lines 284 (
    Superga
    ) and 691 (
    – Nicolás Velásquez Jan 30 '18 at 20:30
  • To execute those javascripts you will need to either learn how to navigate through Selenium or a WebDriver, or to get to load the whole page before saving it (i.e. iMacros or headless Chrome). Again, I 'd bet that this SO is a good place for you to start: https://stackoverflow.com/questions/29861117/r-rvest-scraping-a-dynamic-ecommerce-page – Nicolás Velásquez Jan 30 '18 at 20:33
0

Thank you for your kind answer. I will dive into that direction. I also find a way to get the name of the brand without RSelenium, here si my code:

library('httr')
library('magrittr')
library('rvest')

################# FUNCTION #################
extract_data <- function(firstPosition,lastPosition){
  mapply(function(first,last){
    substr(pageContent, first, last) %>% 
    gsub( "\\W", "\\1 ",.)  %>%
    gsub("^ *|(?<= ) | *$", "", ., perl = TRUE)
  },
  firstPosition, lastPosition )
}
############################################

url <- 'https://www.zalando.nl/damesschoenen-sneakers/'
page <- GET(url)
pageContent <- content(page, as='text')

# Get the brand name of the products
firstPosition <-   
unlist(gregexpr('brand_name',pageContent))+nchar('brand_name')+1
lastPosition <- unlist(gregexpr('is_premium',pageContent))-2

extract_data(firstPosition, lastPosition)

Unfortunately it starts being difficult when you want something else than brand name so maybe that the best soution is to do it with RSelenium.