4

I would like to read csv files from S3 using fread from the data.table package like this:

 ulr_with_signature <- signURL(url, access_key, secret_key)
 DT <- fread(ulr_with_signature)

Is there a package or piece of code somewhere that will allow me to build URL using access/secret key pair.

I would like not to use awscli for reading the data.

Bulat
  • 6,494
  • 1
  • 25
  • 47
  • Here is a question about writing data directly to S3 with answer on reading to memory as well: http://stackoverflow.com/questions/30084595/write-r-data-as-csv-directly-to-s3 – Bulat Jun 01 '16 at 16:31

1 Answers1

8

You can use the AWS S3 package:

To perform your read:

# These variables should be set in your environment, but you could set them in R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "mykey",
       "AWS_SECRET_ACCESS_KEY" = "mysecretkey",
       "AWS_DEFAULT_REGION" = "us-east-1")

library("aws.s3")

If you have an R object obj you want to save to AWS, and later read:

s3save(obj, bucket = "my_bucket", object = "object")
# and then later
obj <- s3load("object", bucket = "my_bucket")

Obviously substituting the bucket name and filename (the name of the object in the AWS bucket) for real values. The package also has a corresponding s3save function. You can also save and load in RDS format with s3saveRDS and s3readRDS.

If you need to read a text file, it's a bit more complicated, as the library's function 'get_object' returns a raw vector, and we have to parse it ourselves:

raw_data <- get_object('data.csv', 'my_bucket')

# this method to parse the data is copied from the httr library
# substitute encoding from as needed
data <- iconv(readBin(raw_data, character()), from="UTF-8", to="UTF-8")

# now the data can be read by any R function, eg.
read.csv(data)
fread(data)

# All this can be done without temporary objects:
fread(iconv(
  readBin(get_object('data.csv', 'my_bucket'), character()),
  from="UTF-8", to="UTF-8"))

Your notion of a ‘signed URL’ is not available, as far as I know. A caveat, should you try to develop such a solution: It is important to think of the security implications of storing your secret access key in the source code.

Another concern about the ‘signed url’, is that the object would be stored in memory. If the workspace is saved, it would be stored on disk. Such a solution would have to review security carefully.

  • 1
    Sorry about that – hit enter too early. You are worried that this creates a temporary file? Aha. As far as I can tell from the code, this doesn't actually create a temporary file, but I might be mistaken. `s3load` calls `get_object` (no file creation here), `get_object` calls `s3HTTP`, still with no local files, and that function calls the GET method from the `httr` package. I can't see `awscli` anywhere. – pusillanimous Jun 01 '16 at 09:25
  • And to clarify my initial post: "filename" is not a local filename. It is the filename of the file _in the cloud_, ie. the S3 file name. – pusillanimous Jun 01 '16 at 09:27
  • Ok, I think we are getting there. I guess to do `fread(url)` I could use `signature_v2_auth` or `signature_v4_auth` from the aws.signature package? - https://cran.r-project.org/web/packages/aws.signature/aws.signature.pdf – Bulat Jun 01 '16 at 10:15
  • 1
    No, I'm sorry, that won't work. There is no way to store a URL to an AWS file _with_ the authentication. The authentication works by sending a signature (as created by the `signature_v4_auth` method) as a **header** with the GET request. So the request looks like this (not exactly – but the point is clear): `GET /bucket/filename; Authentication: AUTH TOKEN HERE`. The problem for your idea is that the URL is separate from the authentication token, and it just can't be stored as a single object. (Sorry about the formatting) – pusillanimous Jun 01 '16 at 10:18
  • And to expand a bit: The URL is a Uniform Resource Locator. It specifies where the object (file) is found. Your login information is not part of that, because it is information about state. To allow logins without passwords, AWS has token authentication. This means the program can request a (time-limited) token by sending Amazon the user ID and secret key. That token is then sent as meta-information with every request, as a header. – pusillanimous Jun 01 '16 at 10:24
  • And last comment now: You could possibly get an access token, store it, and construct the request manually with the authentication header. This is exactly what the library I suggested using does behind the scenes. (And I have tested: It does not save to a temporary file) – pusillanimous Jun 01 '16 at 10:30
  • Ok, so the last problem I have with s3load, is that all examples are talking about `.Rdata` files. While I want to read `CSV` file with `fread` to data.table. – Bulat Jun 01 '16 at 10:33
  • Then you want to use `get_object` directly. It reads any file into memory instead of converting to and from .RData. So you could do: `read.csv(get_object("data.csv", "my_bucket"))`. If you need to save, use `put_object` (and not `save_object` - that one is for saving locally). – pusillanimous Jun 01 '16 at 10:37
  • I get this error: `Error in fread(get_object(my_url, my_bucket)) : 'input' must be a single character string containing a file name, a command, full path to a file, a URL starting 'http[s]://', 'ftp[s]://' or 'file://', or the input data itself` – Bulat Jun 01 '16 at 10:57
  • 2
    This is because `get_object` returns a raw vector, and not a character vector or connection. You could do this, admittedly a bit convoluted: `fread(iconv(readBin(get_object(url, bucket), character()), from = "UTF-8", to = "UTF-8"))` (substituting from="UTF-8" with the appropriate encoding) – pusillanimous Jun 01 '16 at 11:21
  • Cool, that worked. 2+ times speed up compared to the `aws cli` wrapper (e.g. copy to local first). do you mind updating your answer, cause it is not correct as is ? – Bulat Jun 01 '16 at 12:25
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/113553/discussion-between-bulat-and-sigvei). – Bulat Jun 01 '16 at 16:28