rvest Error in open.connection(x, "rb") : Timeout was reached

Question

I'm trying to scrape the content from http://google.com. the error message come out.

library(rvest)  
html("http://google.com")

Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")

since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .

have you also tried the `read_html` command, since the error message says `html` is deprecated... This might not solve you problem but maybe the output is more helpful... — drmariod, Oct 23 '15 at 06:02
yes,the message is :Error in open.connection(x, "rb") : Timeout was reached In addition: Warning message: closing unused connection 3 (http://google.com) — user3267649, Oct 26 '15 at 07:42
actually , this code works fine in my home network. but when I try to use this code in the company network ,the error comes up. — user3267649, Oct 26 '15 at 07:45
Seems not reproducible as a code issue, this returns a result for me. If you figured out what was going on with the network and how to work around it you could post that answer. — Sam Firke, Nov 09 '15 at 16:16
Same issue for me, apparently from the network I am using google asks proof of not being a bot, and the page of course times out when the scraper runs. — Dambo, Jul 17 '17 at 14:14

score 33 · Answer 1 · edited May 23 '17 at 11:47

33

I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.

Here's what worked for me,

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

Credit : https://stackoverflow.com/a/38463559

edited May 23 '17 at 11:47

Community

1
1

answered Mar 03 '17 at 01:46

user799188

12,159
3
30
35

That worked for me as well... In my case I found a more permanent solution to be setting the proxy environment variables. Here are the steps: https://stackoverflow.com/a/60100844/1839959 – Stan Feb 06 '20 at 17:52
Thank you- that worked for me, using company network. – Carlo Carandang Mar 25 '21 at 02:55

score 7 · Answer 2 · edited Aug 04 '16 at 16:54

This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.

library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))

score 0 · Answer 3 · answered Sep 30 '17 at 03:49

0

I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.

answered Sep 30 '17 at 03:49

Brent B

158
6

score 0 · Answer 4 · answered Apr 08 '18 at 14:36

I was facing a similar problem and a small hack solved it. There were 2 characters in the hyperlink who were creating the problem for me. Hence I replaced "è" with "e" & "é" with "e" and it worked. But just ensure that the hyperlink still remains valid.

score 0 · Answer 5 · answered Aug 05 '20 at 02:08

I got the error message when my laptop was wifi connected to my router, but my ISP was having some sort of an outage:

read_html(brand_url)
Error in open.connection(x, "rb") : 
  Timeout was reached: [somewebsite.com.au] Operation timed out after 10024 milliseconds with 0 out of 0 bytes received

In the above case, my wifi was still connected to the modem, but pages wouldn't load via rvest (nor in a browser). It was temporary and lasted ~2 minutes.

May also be worth noting that a different error message is received when wifi is turned off entirely:

brand_page <- read_html(brand_url)
Error in open.connection(x, "rb") : 
  Could not resolve host: somewebsite.com.au

rvest Error in open.connection(x, "rb") : Timeout was reached

5 Answers5

Linked