19

I am trying to scrape data from Yelp. One step is to extract links from each restaurant. For example, I search restaurants in NYC and get some results. Then I want to extract the links of all the 10 restaurants Yelp recommends on page 1. Here is what I have tried:

library(rvest)     
page=read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name span") %>% html_attr('href')

But the code always returns 'NA'. Can anyone help me with that? Thanks!

ulfelder
  • 4,898
  • 1
  • 18
  • 32
Allen
  • 337
  • 1
  • 3
  • 12

2 Answers2

40
library(rvest)     
page <- read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name") %>% html_attr('href')

Hope this would simplify your problem

Kim
  • 2,941
  • 2
  • 22
  • 45
Bharath
  • 1,401
  • 12
  • 21
9

I also was able to clean the results from above which for me were quite noisy

links <- page %>% html_nodes("a") %>% html_attr("href")

with a simple regex string matching

links <- links[which(regexpr('common-url-element', links) >= 1)].

oliver
  • 1,499
  • 2
  • 12
  • 26
  • 1
    Or, if you want to do this in the `tidyverse`, you can just add `%>% str_subset("your regex here")` to the end of that pipe. – ulfelder Aug 28 '20 at 18:19