3

Imaging you have 3 lists/arrays in javascript/NodeJs

  1. array contains 1.000.000 data-items
  2. array contains 100.000 data-items
  3. array contains 50.000 data-items

Each data-item is a object with 2 properties - Like a date and a price. All items in array 2 and 3 are subsets/sub-list of items from array 1.

My question: How do i in the fastest way - match all dates of each item from array 1 - with all the dates in array 2 and 3 - for each single item in array 1?

I'm used to .NET/C# - where something like 'contains(item)' is nice... right now in NodeJS i use 3 for-loops - which is way to slow...i need some kind of index or the like to speed up the process...

An example of data could be like:

Input:

array 1: 1,2,3,4,5,6,7,8,10
array 2: 2,3,5,7,9
array 3: 1,4,5,10

Out-put (written to a file):

1,'',1
2,2,''
3,3,''
4,'',4
..ect...
PabloDK
  • 1,501
  • 1
  • 7
  • 26
  • Are there duplicate dates? Is the date a `Date` object or a string? Can you include an example object with the date and price which would and would not match? Also how much RAM are you working with? – jmunsch May 14 '16 at 20:49
  • In my postgresql db the dates are saved as "Timestamp with time zone". "2014-12-01 05:33:56.761199+00" example of data item: new TradePoint(date, price) - if the date matches, add the price to the new list/line... regarding memory, as little as possible... 2-4-8gb...No dublicate date/items in each list. – PabloDK May 14 '16 at 20:59
  • 1
    @PabloDK, are the arrays guaranteed to be sorted in advance? – zzzzBov May 14 '16 at 22:04
  • Is there a reason you are using arrays rather than objects? – Soren May 15 '16 at 05:12
  • Let me rephrase my question: Instead of using a small object with 2 properties and push that into arrays. Is there any other structure/type of collection which is faster/performs better in javascript (especially regarding finding a given object in the array as fast as possible)? – PabloDK May 15 '16 at 06:15

4 Answers4

2
var t = {}
// loop through once and create a constant time lookup
arr1.forEach(function(item){t[item.date] = {1: item.price, 2: null, 3:null})
arr2.forEach(function(item){if (t[item.date]) t[item.date].2 = item.price})
arr3.forEach(function(item){if (t[item.date]) t[item.date].3 = item.price})

This will be a linear operation, the tradeoff with sorting the data first may or may not be worth the time to do the sorting.

It would be about the same as a triple JOIN either way, the solution I have provided is O(N) where as nested loops might be O(N^3) a sorted solution would still probably be O(Nlog(N)) just a guess.

If the dates are already sorted you could potentially bucketize the dates or do some sort of radix search, it might speed it up a bit.

See: https://en.m.wikipedia.org/wiki/Radix_tree

You might also be able to do it with promises so it runs async:

var t = {}
// loop through once and create a constant time lookup
arr1.forEach(function(item){t[item.date] = {1: item.price, 2: null, 3:null})

var promiseArray = arr2.map(function(item){
    return Promise.resolve(item)
        .then(function(item){
              if (t[item.date]) t[item.date].2 = item.price})
         })
// concat the two promise arrays together 
promiseArray.concat(arr3.map(function(item){
    return Promise.resolve(item)
        .then(function(item){
              if (t[item.date]) t[item.date].3 = item.price})
         }))
// resolve all the promises
Promise.all(promiseArray)
    .then(function(){
        // t has results
        debugger
    })
jmunsch
  • 16,405
  • 6
  • 74
  • 87
  • Thank you so much! Well as you properly guessed im not expert in javascript(yet) - im used to typed languages like C# ect... i understand every thing in your example tough... BUT i try to wrap my head around - how a single treaded process can perform better - faster/slower - via some thing like async...when it still just use 1 CPU core? Is it due to the fact that the work it waits for - is done by a completely other thread/process/CPU core (by defualt?) otherwise i don't understand how and why some thing like 'async' can improve any thing? How would you run this on 4 CPU cores simultaneously? – PabloDK May 15 '16 at 08:22
  • @PabloDK interesting question. Concurrency with node is done with the `cluster` module and it would require message passing. Its benefit comes in when handling a lot of incoming requests, more or less the requests get load balanced across each core. By writing it with promises it gives a single thread / core / the event loop a chance to process other incoming requests without blocking. So the benefit would come in if other requests were coming in while the promises waited to be resolved. – jmunsch May 15 '16 at 10:02
  • And for an answer to shared memory see: http://stackoverflow.com/questions/10965201/in-node-js-how-to-declare-a-shared-variable-that-can-be-initialized-by-master-p i think the comments there suggest `ems` but it'd probably be easier to offload `var t` into a redis instance and access it there with the forked promises. – jmunsch May 15 '16 at 10:33
  • Just another thought, its both a benefit and downside to node. – jmunsch May 15 '16 at 10:41
  • I don't remember all the details...but once i saw a video with one of the guys from StroopLoop...talking about this issue...and as i recall it... Cluster is not "TRUE" multi-cpu core programming... My issue is not to handle many requests at all - its all server side code/server app... so its only a matter of raw server power and optimized code - the way i see it... – PabloDK May 15 '16 at 14:30
  • @PabloDK then nodejs is the wrong tool if "TRUE" concurrency is the issue. – jmunsch May 15 '16 at 17:33
1

I'd try to firstly sort all arrays by your key property (Date, afaiu) (if not sorted yet), then use a single for loop over the 1st array with cursors in two other arrays, which would move to the next item only when the current item has been written to the output. This way there wouldn't be any "contains" search through the whole arrays.

Example:

var j = 0;
var k = 0;
for( var i = 0; i < array1.length; ++i ) {
    var out1 = array1[i].date;
    var out2 = j < array2.length && array2[j].date == out1 ? array2[j++].value : '';
    var out3 = k < array3.length && array3[k].date == out1 ? array3[k++].value : '';
    output( out1, out2, out3 );
}
JustAndrei
  • 803
  • 4
  • 17
  • This is exactly what im currently doing... but i though there must be a faster way in javascript? And yes - all lists/arrays are sorted in the same order by date... – PabloDK May 14 '16 at 21:11
  • @PabloDK, please check my example I've just added. Are you sure you are doing something like that? Because you wrote you had 3 for loops, while it's possible to use just one. – JustAndrei May 14 '16 at 21:17
  • You have to store the lengths in variables. – Knu May 14 '16 at 21:22
  • @Knu, OMG, this is just a concept of a single for loop with 2 additional cursors, and I typed it on my mobile phone. Of course whatever gets repeatedly resolved into a constant value inside a loop, must be calculated just once, outside of that loop. But this optimization is so obvious for the author of the question that I decided not to make the example overcomplicated. – JustAndrei May 14 '16 at 21:27
  • @JustAndrei - i see your code loop and iterate through array1...but the way i see it the code - it doesnt loop throug array2 or array3? Its just make one single comparision (var out2 = j < array2.length && array2[j].date == out1 ? array2[j++].value : '';)? Please correct me if im wrong og missed some thing. – PabloDK May 15 '16 at 07:03
  • @PabloDK, let me explain this line: var out2 = j < array2.length && array2[j].date == out1 ? array2[j++].value : ''; It is equivalent to: var out2; if( j < array2.length && array2[j].data == out1 ) { out2 = array2[j].value; j = j + 1; /* a loop */ } else { out2 = ''; } – JustAndrei May 16 '16 at 08:20
  • @JustAndrei: I do know the notation " X ? Y : Z "... but it doesnt change - that the only loop you have is the "for( var i = 0; i < array1.length; ++i )"? The rest i just a simple IF-statement... – PabloDK May 17 '16 at 10:03
  • @PabloDK, i is index for array1, while j is index for array2. i is incremented by the for operator, while j is incremented as array2[j++], i.e. only when the matching date is found and sent to output. – JustAndrei May 19 '16 at 06:48
1

it's javascript! Consider restructuring your arrays to objects, so that the Date property would become a key, e.g.: var arr2 = { '2016-05-13 00:00:01': { prop: 'value', prop2: 'value' }, '2016-05-13 00:00:02': { prop: 'value', prop2: 'value' }, ... }; This way arr2[date] either returns an object or undefined. If you got an object, convert it to a string suitable for output; otherwise write '' or whatever.

JustAndrei
  • 803
  • 4
  • 17
  • Thank you so much! I didnt realize how values/prop and keys worked before in javascript - but now i do! I just tested...a basic for-loop with 10 million iterations (insertions into the object) - takes 2650 ms vs. a key lookup in the object with 10 millions objects at only 3 ms!! – PabloDK May 15 '16 at 10:07
0

If the arrays are sorted I guess the fastest Array intersection algorithm in JS is like this

function intersect(a1, a2)
{
  var a1i = 0,
      a2i = 0,
    isect = [];

  while( a1i < a1.length && a2i < a2.length ){
    if (a1[a1i].date < a2[a2i].date) a1i++;
     else if (a1[a1i].date > a2[a2i].date) a2i++;
     else {isect.push(a1i); // they match
           a1i++;
           a2i++;}
  }

  return isect;
}

and once you got the intersection indices you can easily construct your desired output.

But then if you would like to come up with a cool tool ... Then why not inventing Array.prototype.intersect()

Array.prototype.intersect = function(...a) {
  return [this,...a].reduce((p,c) => p.filter(e => c.includes(e)));
}
var arrs = [[0,2,4,6,8],[4,5,6,7],[4,6]],
     arr = [0,1,2,3,4,5,6,7,8,9];

document.write("<pre>" + JSON.stringify(arr.intersect(...arrs)) + "</pre>");
Redu
  • 19,106
  • 4
  • 44
  • 59