0

I am scraping Amazon customer reviews using R and have come across a bug that I was hoping someone might have some insight into.

I have noticed that R fails to scrape the specified node (found by using SelectorGadget) from all reviews. Each time I run the script I retrieve a different amount, but never the entirety. This is very frustrating since the goal is to scrape the reviews and compile them into csv files that can later be manipulated using R. Essentially, if a product has 200 reviews, when I run the script, sometimes I will get 150 reviews, sometimes 75 reviews, etc- but not the entire 200. This issue seems to happen after I have done repeated scraping.

I have also gotten a few timeout errors, specifically "Error in open.connection(x, "rb") : Timeout was reached".

How do I get around this to continue scraping? I am a beginner but any help or insight is greatly appreciated!!

 url <- "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_show_all?ie=UTF8&reviewerType=all_reviews&pageNumber="

N_pages <- 204
A <- NULL
for (j in 1: N_pages){
   pant <- read_html(paste0(url, j)) 
   B <- cbind(pant %>% html_nodes(".review-text") %>%     html_text()     )
   A <- rbind(A,B)
 }
tail(A)


print(j) 
PugFanatic
  • 31
  • 2
  • 7

1 Answers1

1

Is this not working for you?

Setting the URL as "https://www.amazon.com/Match-Mens-Wild-Cargo-Pants/product-reviews/B009HLOZ9U/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber="

N_pages <- 204
A <- NULL
for (j in 1: N_pages){
  pant <- read_html(paste0(url, j)) 
  B <- cbind(pant %>% html_nodes(".review-text") %>%     html_text()     )
  A <- rbind(A,B)
}
tail(A)
        [,1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
[1938,] "This is really a good item to get. Trendy, probably you can choose a different color, it fits good but I wouldn't say perfect."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
[1939,] "I don't write reviews for most products, but I felt the need to do so for these pants for a couple reasons.  First, they are great pants!  Solid material, well-made, and they fit great.  Second, I want to echo those who say you need to go up in size when you order.  I wear anywhere from 32-34, depending on the brand.  I ordered these in a 36 and they fit like a 33 or 34.  I really love the look and feel of these, and will be ordering more!"                                                                                                                                                            
[1940,] "I bought the green one before, it is good quality and looks nice, than I purchased the similar one, but the  khaki color, but received absolutely different product, different material. really disappointed."                                                                                                                                                                                                                                                                                                                                                                                                          
[1941,] "These pants are great!  I have been looking to update my wardrobe with a more edgy style; these cargo pants deliver on that.  Paired with some casual sneakers or a decent nubuck leather boot completes the look from the waist down.  The lazy-casual look is great when traveling, as are the many pockets.  I wore these pants on a recent day trip to NYC and traveled comfortably with essential items contained in the 8 pockets.  I placed a second order shortly after my first pair arrived because I like them so much.  Shipping and delivery is also fairly fast, considering these pants ship from China!"
[1942,] "Pants are awesome, just like the picture. The size runs small, so if you order them I would order them bigger than normal. I usually wear a 34inch waist because i dont like my pants snug, these pants fit more like a 32 inch waist.Other than that i love them!"                                                                                                                                                                                                                                                                                                                                                     
[1943,] "the good:Pants are made from the durable cotton that has a nice feel; have a lot of useful features and roomy well placed pockets; durable stitching.the bad:Pants will shrink and drier/hot water is not recommended. Would have been better if the cotton was pretreated to prevent shrinking. I would gladly gave up the belt if I wouldn't have to wary about how to wash these pants.the ugly:faux pocket with a zipper. useless feature. on my pair came with a bright gold zipper, unlike a silver in a picture." 
ZLevine
  • 192
  • 11
  • Thank you so much for your input, I seriously appreciate it!! However, this product appears to have 2038 total reviews, whereas your code yields 1943 reviews (even if it was just from the second page on there seems to be a deficit of ~100 reviews)? – PugFanatic Mar 07 '17 at 19:17
  • ah, I only did verified reviews! If you want it to be all reviews, you need to change type in the URL. e.g. "Type=all_reviews" instead of "Type=avp_only_reviews". – ZLevine Mar 07 '17 at 19:25
  • Oh ok! Amazing that the error is as simple as that! Sometimes I had been scraping from just the second page on and was still getting this incomplete scraping error, because I do remember learning to try using the second pages (but was never sure of why). Also, I tried playing around with the number of pages scraped, and it seems like if I increase that number to beyond the number of pages that contain reviews, it seems to work? Again, thank you so much for taking the time to help me!! I have been struggling with this! – PugFanatic Mar 07 '17 at 19:31
  • It's hard to assess the problem without being able to reproduce it. It doesn't make sense to me that arbitrarily increasing N_pages would be a useful thing though. My best idea is to scrape the value of N_pages instead of resetting it every time. That way you'd be positive your process is the same every time – ZLevine Mar 07 '17 at 20:01
  • Do you think it might be amazon actively trying to prevent scraping? – PugFanatic Mar 07 '17 at 20:59
  • I'm not sure, have you looked here http://stackoverflow.com/questions/36043172/package-rvest-for-web-scraping-https-site-with-proxy/38463559#38463559 or here http://stackoverflow.com/questions/33295686/rvest-error-in-open-connectionx-rb-timeout-was-reached? – ZLevine Mar 07 '17 at 21:49
  • I will look into those answers- @ZLevine thank you so much for all your help again!!! – PugFanatic Mar 09 '17 at 15:42