Connections were introduced in R 1.2.0 and described by Brian Ripley in the first issue of R NEWS (now called The R Journal) of January 2001 (page 16-17) as an abstracted interface to IO streams such as a file, url, socket, or pipe. In 2013, Simon Urbanek added a Connections.h C API which enables R packages to implement custom connection types, such as the curl package.
One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readBin
, writeBin
, readLines
and writeLines
functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:
# Read the first 30 lines, 10 lines at a time
con <- url("http://jeroen.github.io/data/diamonds.json")
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)
Same for writing, e.g. to a file:
tmp <- file(tempfile())
open(tmp, "w")
writeLines("A line", tmp)
writeLines("Another line", tmp)
close(tmp)
Open the connection as rb
or wb
to read/write binary data (called raw vectors in R):
# Read the first 3000 bytes, 1000 bytes at a time
con <- url("http://jeroen.github.io/data/diamonds.json")
open(con, "rb")
data1 <- readBin(con, raw(), n = 1000)
data2 <- readBin(con, raw(), n = 1000)
data3 <- readBin(con, raw(), n = 1000)
close(con)
The pipe()
connection is used to run a system command and pipe text to stdin
or from stdout
as you would do with the |
operator in a shell. E.g. (lets stick with the curl examples), you can run the curl
command line program and pipe the output to R:
con <- pipe("curl -H 'Accept: application/json' https://jeroen.github.io/data/diamonds.json")
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
Some aspects of connections are a bit confusing: to incrementally read/write data you need to explicitly open()
and close()
the connection. However, readLines
and writeLines
automatically open and close (but not destroy!) an unopened connection. As a result, the example below will read the first 10 lines over and over again which is not very useful:
con <- url("http://jeroen.github.io/data/diamonds.json")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
identical(data1, data2)
Another gotcha is that the C API can both close and destroy a connection, but R only exposes a function called close()
which actually means destroy. After calling close()
on a connection it is destroyed and completely useless.
To stream-process data form a connection you want to use a pattern like this:
stream <- function(){
con <- url("http://jeroen.github.io/data/diamonds.json")
open(con, "r")
on.exit(close(con))
while(length(txt <- readLines(con, n = 10))){
some_callback(txt)
}
}
The jsonlite
package relies heavily on connections to import/export ndjson data:
library(jsonlite)
library(curl)
diamonds <- stream_in(curl("https://jeroen.github.io/data/diamonds.json"))
The streaming (by default 1000 lines at a time) makes it fast and memory efficient:
library(nycflights13)
stream_out(flights, file(tmp <- tempfile()))
flights2 <- stream_in(file(tmp))
all.equal(flights2, as.data.frame(flights))
Finally one nice feature about connections is that the garbage collector will automatically close them if you forget to do so, with an annoying warning:
con <- file(system.file("DESCRIPTION"), open = "r")
rm(con)
gc()