49

I want to download a zip file from the internet and unzip it in memory without saving to a temporary file. How can I do this?

Here is what I tried:

var url = 'http://bdn-ak.bloomberg.com/precanned/Comdty_Calendar_Spread_Option_20120428.txt.zip';

var request = require('request'), fs = require('fs'), zlib = require('zlib');

  request.get(url, function(err, res, file) {
     if(err) throw err;
     zlib.unzip(file, function(err, txt) {
        if(err) throw err;
        console.log(txt.toString()); //outputs nothing
     });
  });

[EDIT] As, suggested, I tried using the adm-zip library and I still cannot make this work:

var ZipEntry = require('adm-zip/zipEntry');
request.get(url, function(err, res, zipFile) {
        if(err) throw err;
        var zip = new ZipEntry();
        zip.setCompressedData(new Buffer(zipFile.toString('utf-8')));
        var text = zip.getData();
        console.log(text.toString()); // fails
    });
pathikrit
  • 29,060
  • 33
  • 127
  • 206
  • 6
    Note well that `zlib` doesn't handle zip file format, it only handles gzip and deflate formats. The `zlib.unzip` function is misleadingly named as it only decompresses gzip and deflate formats. You need a zip format library. – Dan D. Apr 28 '12 at 00:38
  • 1
    This zipfile looks promising https://github.com/springmeyer/node-zipfile/blob/master/README.md – Tina CG Hoehr Apr 28 '12 at 00:39
  • @Dan: Actually, zlib also handles the [`zlib` format](http://www.ietf.org/rfc/rfc1950.txt) (which in turn uses deflate). But that's totally irrelevant here, so +1 :-) – Cameron Apr 28 '12 at 04:01
  • Possible duplicate. http://stackoverflow.com/questions/2095697/unzip-files-using-javascript – Larry Battle Apr 30 '12 at 20:46
  • 2
    Your second example from the edit is using `request.get` which automatically calls `toString()` on the returned data. But `adam-zip` requires a `Buffer` not a `String`. Use `request({url: url, encoding: null}, function(err, res, zipFile) { ...` instead of `request.get` to make `request` return a `Buffer`. (Although when I did that I got a `CRC32 checksum failed` error :( You should really just skip `request` and use mihai's answer from below. – Nathan Friedly May 03 '12 at 18:52

4 Answers4

86

You need a library that can handle buffers. The latest version of adm-zip will do:

npm install adm-zip

My solution uses the http.get method, since it returns Buffer chunks.

Code:

var file_url = 'http://notepad-plus-plus.org/repository/7.x/7.6/npp.7.6.bin.x64.zip';

var AdmZip = require('adm-zip');
var http = require('http');

http.get(file_url, function(res) {
  var data = [], dataLen = 0; 

  res.on('data', function(chunk) {
    data.push(chunk);
    dataLen += chunk.length;

  }).on('end', function() {
    var buf = Buffer.alloc(dataLen);

    for (var i = 0, len = data.length, pos = 0; i < len; i++) { 
      data[i].copy(buf, pos); 
      pos += data[i].length; 
    } 

    var zip = new AdmZip(buf);
    var zipEntries = zip.getEntries();
    console.log(zipEntries.length)

    for (var i = 0; i < zipEntries.length; i++) {
      if (zipEntries[i].entryName.match(/readme/))
        console.log(zip.readAsText(zipEntries[i]));
    }
  });
});

The idea is to create an array of buffers and concatenate them into a new one at the end. This is due to the fact that buffers cannot be resized.

Update

This is a simpler solution that uses the request module to obtain the response in a buffer, by setting encoding: null in the options. It also follows redirects and resolves http/https automatically.

var file_url = 'https://github.com/mihaifm/linq/releases/download/3.1.1/linq.js-3.1.1.zip';

var AdmZip = require('adm-zip');
var request = require('request');

request.get({url: file_url, encoding: null}, (err, res, body) => {
  var zip = new AdmZip(body);
  var zipEntries = zip.getEntries();
  console.log(zipEntries.length);

  zipEntries.forEach((entry) => {
    if (entry.entryName.match(/readme/i))
      console.log(zip.readAsText(entry));
  });
});

The body of the response is a buffer that can be passed directly to AdmZip, simplifying the whole process.

mihai
  • 32,161
  • 8
  • 53
  • 79
  • Yea, thanks...like always, there are a lot of things that could be used. I considered it, but thought an example with just node.js code might be better. – mihai May 01 '12 at 22:47
  • 2
    I just want to emphasize that simply installing adm-zip with `npm install adm-zip` will not work because only the latest version on github supports buffers. – enyo May 04 '12 at 09:44
  • 2
    The latest version on npm supports buffers. – Nikolai Sep 10 '12 at 21:11
  • 11
    Doesn't work with the zips that come out of github tags - ERROR Invalid or unsupported zip format. No END header found – Sam Jul 04 '13 at 04:19
  • @Sam I've had more success using `require('restler-q').get(URL).then`, which will download the entire thing in memory, and write the whole thing to disk. Not as efficient, but chunking solutions aren't working for me either... – Droogans Apr 22 '15 at 16:13
  • @Sam writing the whole thing from memory like I said above won't work -- it gets transformed into a string if you don't use streams. This answer worked for me when downloading github zip files: http://stackoverflow.com/a/12029764/881224 – Droogans Apr 22 '15 at 20:53
  • 5
    I used `axios` for making the request, which has the option to download the whole thing as an ArrayBuffer if you set [responseType](https://github.com/axios/axios#request-config) to `'arraybuffer'`. Then you can pass the `response.data` directly to AdmZip – Ciprian Tomoiagă Jan 16 '18 at 08:41
  • @Sam Github zips don't work because `http.get` doesn't follow redirects. It's not related to unzipping – mihai Nov 21 '18 at 16:15
  • Thank. This was EXACTLY what I was looking for – mraxus Apr 02 '20 at 14:08
5

Sadly you can't pipe the response stream into the unzip job as node zlib lib allows you to do, you have to cache and wait the end of the response. I suggest you to pipe the response to a fs stream in case of big files, otherwise you will full fill your memory in a blink!

I don't completely understand what you are trying to do, but imho this is the best approach. You should keep your data in memory only the time you really need it, and then stream to the csv parser.

If you want to keep all your data in memory you can replace the csv parser method fromPath with from that takes a buffer instead and in getData return directly unzipped

You can use the AMDZip (as @mihai said) instead of node-zip, just pay attention because AMDZip is not yet published in npm so you need:

$ npm install git://github.com/cthackers/adm-zip.git

N.B. Assumption: the zip file contains only one file

var request = require('request'),
    fs = require('fs'),
    csv = require('csv')
    NodeZip = require('node-zip')

function getData(tmpFolder, url, callback) {
  var tempZipFilePath = tmpFolder + new Date().getTime() + Math.random()
  var tempZipFileStream = fs.createWriteStream(tempZipFilePath)
  request.get({
    url: url,
    encoding: null
  }).on('end', function() {
    fs.readFile(tempZipFilePath, 'base64', function (err, zipContent) {
      var zip = new NodeZip(zipContent, { base64: true })
      Object.keys(zip.files).forEach(function (filename) {
        var tempFilePath = tmpFolder + new Date().getTime() + Math.random()
        var unzipped = zip.files[filename].data
        fs.writeFile(tempFilePath, unzipped, function (err) {
          callback(err, tempFilePath)
        })
      })
    })
  }).pipe(tempZipFileStream)
}

getData('/tmp/', 'http://bdn-ak.bloomberg.com/precanned/Comdty_Calendar_Spread_Option_20120428.txt.zip', function (err, path) {
  if (err) {
    return console.error('error: %s' + err.message)
  }
  var metadata = []
  csv().fromPath(path, {
    delimiter: '|',
    columns: true
  }).transform(function (data){
    // do things with your data
    if (data.NAME[0] === '#') {
      metadata.push(data.NAME)
    } else {
      return data
    }
  }).on('data', function (data, index) {
    console.log('#%d %s', index, JSON.stringify(data, null, '  '))
  }).on('end',function (count) {
    console.log('Metadata: %s', JSON.stringify(metadata, null, '  '))
    console.log('Number of lines: %d', count)
  }).on('error', function (error) {
    console.error('csv parsing error: %s', error.message)
  })
})
kilianc
  • 5,973
  • 2
  • 22
  • 36
3

If you're under MacOS or Linux, you can use the unzip command to unzip from stdin.

In this example I'm reading the zip file from the filesystem into a Buffer object but it works with a downloaded file as well:

// Get a Buffer with the zip content
var fs = require("fs")
  , zip = fs.readFileSync(__dirname + "/test.zip");


// Now the actual unzipping:
var spawn = require('child_process').spawn
  , fileToExtract = "test.js"
    // -p tells unzip to extract to stdout
  , unzip = spawn("unzip", ["-p", "/dev/stdin", fileToExtract ])
  ;

// Write the Buffer to stdin
unzip.stdin.write(zip);

// Handle errors
unzip.stderr.on('data', function (data) {
  console.log("There has been an error: ", data.toString("utf-8"));
});

// Handle the unzipped stdout
unzip.stdout.on('data', function (data) {
  console.log("Unzipped file: ", data.toString("utf-8"));
});

unzip.stdin.end();

Which is actually just the node version of:

cat test.zip | unzip -p /dev/stdin test.js

EDIT: It's worth noting that this will not work if the input zip is too big to be read in one chunk from stdin. If you need to read bigger files, and your zip file contains only one file, you can use funzip instead of unzip:

var unzip = spawn("funzip");

If your zip file contains multiple files (and the file you want isn't the first one) I'm afraid to say you're out of luck. Unzip needs to seek in the .zip file since zip files are just a container, and unzip may just unzip the last file in it. In that case you have to save the file temporarily (node-temp comes in handy).

enyo
  • 14,875
  • 7
  • 48
  • 67
  • 2
    I'm interested in reasoning for someone down voting without leaving a comment. Seriously, what is the reason for this to not work? I'm a beginner. – Strawberry May 02 '12 at 17:54
  • I never got the downvoting without a comment neither... I assume it's because this only works with one file or if the zip is fairly small. – enyo May 04 '12 at 09:36
  • 1
    Thanks for this. I was stuffing around for ages with lots of bad/undocumented zip libraries just trying to unzip an archive. This was gold. – Sam Jul 04 '13 at 04:31
  • The advantage of using 'unzip' when mac or linux is it will keep all the original file permissions after unzip, but node createWriteStream will default write '0666' to the file(adm-zip use it). – mygoare May 21 '15 at 02:43
  • Due to the design of the .zip file format, it's impossible to interpret a .zip file from start to finish without sacrificing correctness. The Central Directory, which is the authority on the contents of the .zip file, is at the **end** of a .zip file, not the beginning. A stream/pipe like this would need to buffer the entire .zip file to get to the Central Directory before interpreting anything (defeating the purpose of this example). Doing this with a nontrivial zip file will result in, "End-of-central-directory signature not found." – Ryan McGeary Mar 14 '17 at 18:54
1

Two days ago the module node-zip has been released, which is a wrapper for the JavaScript only version of Zip: JSZip.

var NodeZip = require('node-zip')
  , zip = new NodeZip(zipBuffer.toString("base64"), { base64: true })
  , unzipped = zip.files["your-text-file.txt"].data;
enyo
  • 14,875
  • 7
  • 48
  • 67
  • 2
    node-zip doesn't support buffers, so you're forced to convert into a string, which is a Bad Thing – Nikolai Sep 10 '12 at 21:00