Using 'rvest' to extract links

Question

I am trying to scrape data from Yelp. One step is to extract links from each restaurant. For example, I search restaurants in NYC and get some results. Then I want to extract the links of all the 10 restaurants Yelp recommends on page 1. Here is what I have tried:

library(rvest)     
page=read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name span") %>% html_attr('href')

But the code always returns 'NA'. Can anyone help me with that? Thanks!

@Bharath, thanks! I know how to do it using RSelenium. I just want to see whether I can do it using only rvest package. — Allen, Feb 06 '16 at 22:19
But even with the Selenium package, I am unable to grab the attribute, It returns NA again for me — Bharath, Feb 06 '16 at 22:25
You're in violation of clause B of their [Terms of Service](http://www.yelp.com/static?country=US&p=tos). They have an API. Why not use it? — hrbrmstr, Feb 06 '16 at 22:43

score 40 · Accepted Answer · edited Dec 05 '18 at 03:21

40

library(rvest)     
page <- read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name") %>% html_attr('href')

Hope this would simplify your problem

edited Dec 05 '18 at 03:21

Kim

2,941
2
22
45

answered Feb 06 '16 at 22:31

Bharath

1,401
12
21

score 9 · Answer 2 · answered Jul 18 '18 at 16:58

9

I also was able to clean the results from above which for me were quite noisy

links <- page %>% html_nodes("a") %>% html_attr("href")

with a simple regex string matching

links <- links[which(regexpr('common-url-element', links) >= 1)].

answered Jul 18 '18 at 16:58

oliver

1,499
2
12
26

1

Or, if you want to do this in the `tidyverse`, you can just add `%>% str_subset("your regex here")` to the end of that pipe. – ulfelder Aug 28 '20 at 18:19

Using 'rvest' to extract links

2 Answers2

Linked