I'm trying to write a very functional manner. We're using Highland.js for managing the stream processing, however because I'm so new I think I'm getting really confused with how I can deal with this unique situation.
The issue here is that all the data in the file stream is not consistent. The first line in a file is typically the header, which we want to store into memory and zip all rows in the stream afterwards.
Here's my first go at it:
var _ = require('highland');
var fs = require('fs');
var stream = fs.createReadStream('./data/gigfile.txt');
var output = fs.createWriteStream('output.txt');
var headers = [];
var through = _.pipeline(
_.split(),
_.head(),
_.doto(function(col) {
headers = col.split(',');
return headers;
}),
......
_.splitBy(','),
_.zip(headers),
_.wrapCallback(process)
);
_(stream)
.pipe(through)
.pipe(output);
The first command in the pipeline is to split the files by lines. The next grabs the header and the doto declares it as a global variable. The problem is the next few lines in the stream don't exist and so the process is blocked...likely because the head() command above it.
I've tried a few other variations but I feel this example give you a sense of where I need to go with it.
Any guidance on this would be helpful -- it also brings up the question of if I have different values in each of my rows how can I splinter the process stream amongst a number of different stream operations of variable length/complexity.
Thanks.
EDIT: I've produced a better result but I'm questioning the efficiency of it -- is there a way I can optimize this so on every run I'm not checking if the headers are recorded? This still feels sloppy.
var through = _.pipeline(
_.split(),
_.filter(function(row) {
// Filter out bogus values
if (! row || headers) {
return true;
}
headers = row.split(',');
return false;
}),
_.map(function(row) {
return row.split(',')
}),
_.batch(500),
_.compact(),
_.map(function(row) {
return JSON.stringify(row) + "\n";
})
);
_(stream)
.pipe(through)