18

I am quite new to R and am trying to access some information on the internet, but am having problems with connections that don't seem to be closing. I would really appreciate it if someone here could give me some advice...

Originally I wanted to use the WebChem package, which theoretically delivers everything I want, but when some of the output data is missing from the webpage, WebChem doesn't return any data from that page. To get around this, I have taken most of the code from the package but altered it slightly to fit my needs. This worked fine, for about the first 150 usages, but now, although I have changed nothing, when I use the command read_html, I get the warning message " closing unused connection 4 (http:....." Although this is only a warning message, read_html doesn't return anything after this warning is generated.

I have written a simplified code, given below. This has the same problem

Closing R completely (or even rebooting my PC) doesn't seem to make a difference - the warning message now appears the second time I use the code. I can run the querys one at a time, outside of the loop with no problems, but as soon as I try to use the loop, the error occurs again on the 2nd iteration. I have tried to vectorise the code, and again it returned the same error message. I tried showConnections(all=TRUE), but only got connections 0-2 for stdin, stdout, stderr. I have tried searching for ways to close the html connection, but I can't define the url as a con, and close(qurl) and close(ttt) also don't work. (Return errors of no applicable method for 'close' applied to an object of class "character and no applicable method for 'close' applied to an object of class "c('xml_document', 'xml_node')", repectively)

Does anybody know a way to close these connections so that they don't break my routine? Any suggestions would be very welcome. Thanks!

PS: I am using R version 3.3.0 with RStudio Version 0.99.902.

CasNrs <- c("630-08-0","463-49-0","194-59-2","86-74-8","148-79-8")
tit = character()
for (i in 1:length(CasNrs)){
  CurrCasNr <- as.character(CasNrs[i])
  baseurl <- 'http://chem.sis.nlm.nih.gov/chemidplus/rn/'
  qurl <- paste0(baseurl, CurrCasNr, '?DT_START_ROW=0&DT_ROWS_PER_PAGE=50')
  ttt <- try(read_html(qurl), silent = TRUE)
  tit[i] <- xml_text(xml_find_all(ttt, "//head/title"))
}
www
  • 35,154
  • 12
  • 33
  • 61
user6469960
  • 193
  • 1
  • 6
  • On R-3.2.5 (win10_64) I get neither warnings nor errors. Can you try it in R-3.2.5 to see if you can reproduce it there? – r2evans Jun 15 '16 at 15:58
  • I'm on R 3.3.0 on OS X and get no errors. You can also use `lapply` and `%>%` chaining to simplify your structure further, though I doubt it'll have an effect on the warning. – alistaire Jun 15 '16 at 16:11
  • Wait, do you have `curl` installed? According to `?read_html`, it will use `curl()` if installed, else the `url()` connection. Both should work, really, but it's worth a shot. – alistaire Jun 15 '16 at 16:18
  • Thanks for the suggestions. I tried the code in R-3.2.5, as well as after installing curl. In both cases, it worked for a little while, but after a few iterations I got the close connection error again. I had also used lapply in the code without any success - here I posted the loop to show that it works for one run, just not for multiples. – user6469960 Jun 16 '16 at 08:16
  • 3
    So, I found it a little bit strange that I could manually step through the loop and everything would work, but when I used the loop, the connection error arose - the only difference is how fast I can manually step through the loop. So I tried adding a pause (currently at 5s) just after the read_html command and now the loop also runs fine (if very slowly). It seems like there was just not enough time for the connections to be closed properly or something similar. In any case, now I can just run the program overnight - it takes a while, but at least it doesn't break down! Thanks for your help! – user6469960 Jun 16 '16 at 13:52
  • Could you post and accept the answer to remove it from the question queue? – Artem Oct 03 '18 at 15:30

2 Answers2

9

I haven't found a good answer for this problem. The best work-around that I came up with is to include the function below, with Secs = 3 or 4. I still don't know why the problem occurs or how to stop it without building in a large delay.

CatchupPause <- function(Secs){
 Sys.sleep(Secs) #pause to let connection work
 closeAllConnections()
 gc()
}
nm200
  • 336
  • 2
  • 12
  • 1
    This was my solution as well, still can't seem to find an answer. Really wish there was a close() function for read_html – Bryan A Jun 04 '19 at 12:06
7

After researching the topic I came up with the following solution:

  url <- "https://website_example.com"
  url = url(url, "rb")
  html <- read_html(url)
  close(url)

# + Whatever you wanna do with the html since it's already saved!