1

We have a Mongo collection named analytics and it tracks user visits by a cookie id. We want to calculate medians for several variables as users visit different pages.

Mongo does not yet have an internal method for calculating the median. I have used the below method for determining it, but I'm afraid there is be a more efficient way as I'm pretty new to JS. Any comments would be appreciated.

// Saves the JS function for calculating the Median. Makes it accessible to the Reducer.
db.system.js.save({_id: "myMedianValue",
    value: function (sortedArray) {
    var m = 0.0;
    if (sortedArray.length % 2 === 0) {
        //Even numbered array, average the middle two values
        idx2 = sortedArray.length / 2;
        idx1 = idx2 - 1;
        m = (sortedArray[idx1] + sortedArray[idx2]) / 2;
    } else {
        //Odd numbered array, take the middle value
        idx = Math.floor(sortedArray.length/2);
        m = sortedArray[idx];
    }
        return m
    }
});


var mapFunction = function () {
    key = this.cookieId;
    value = {
        // If there is only 1 view it will look like this
        // If there are multiple it gets passed to the reduceFunction
        medianVar1: this.Var1,
        medianVar2: this.Var2,
        viewCount: 1
    };

    emit(key, value);
    };

var reduceFunction = function(keyCookieId, valueDicts) {
    Var1Array = Array();
    Var2Array = Array();
    views = 0;

    for (var idx = 0; idx < valueDicts.length; idx++) {
        Var1Array.push(valueDicts[idx].medianVar1);
        Var2Array.push(valueDicts[idx].medianVar2);
        views += valueDicts[idx].viewCount;
    }


    reducedDict = {
        medianVar1: myMedianValue(Var1Array.sort(function(a, b){return a-b})),
        medianVar2: myMedianValue(Var2Array.sort(function(a, b){return a-b})),
        viewCount: views
    };

    return reducedDict
    };


db.analytics.mapReduce(mapFunction,
                       reduceFunction,
                       { out: "analytics_medians",
                         query: {Var1: {$exists:true},
                                 Var2: {$exists:true}
                                 }}
                                 )
Crowson
  • 111
  • 9
  • What's the use case for finding the median? You want the median of a value over the whole collection, just recomputed constantly as the collection may change? Or you want the median of different query result sets? – wdberkeley Mar 03 '15 at 19:37
  • For example, we track real estate searches and want to know the median list price of the homes they are looking at. And yes, I had been working under the assumption that we would need to recalculate the median as the collection grows. I was thinking we could just retain the middle 3-4 values between calculations, but I'm not sure how to incorporate it yet. – Crowson Mar 03 '15 at 19:46
  • See also https://stackoverflow.com/questions/20456095/calculate-the-median-in-mongodb-aggregation-framework – Dan Dascalescu Sep 25 '19 at 07:01

2 Answers2

0

The simple way to get the median value is to index on the field, then skip to the value halfway through the results.

> db.test.drop()
> db.test.insert([
    { "_id" : 0, "value" : 23 },
    { "_id" : 1, "value" : 45 },
    { "_id" : 2, "value" : 18 },
    { "_id" : 3, "value" : 94 },
    { "_id" : 4, "value" : 52 },
])
> db.test.ensureIndex({ "value" : 1 })
> var get_median = function() {
    var T = db.test.count()    // may want { "value" : { "$exists" : true } } if some fields may be missing the value field
    return db.test.find({}, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).skip(Math.floor(T / 2)).limit(1).toArray()[0].value    // may want to adjust skip this a bit depending on how you compute median e.g. in case of even T
}
> get_median()
45

It's not amazing because of the skip, but at least the query will be covered by the index. For updating the median, you could be fancier. When a new document comes in or the value of a document is updated, you compare its value to the median. If the new value is higher, you need to adjust the median up by finding the next highest value from the current median doc (or taking an average with it, or whatever to compute the new median correctly according to your rules)

> db.test.find({ "value" : { "$gt" : median } }, { "_id" : 0, "value" : 1 }).sort({ "value" : 1 }).limit(1)

You'd do the analogous thing if the new value is smaller than the current median. This bottlenecks your writes on this updating process, and has various cases to think about (how would you allow yourself to update multiple docs at once? update the doc that has the median value? update a doc whose value is smaller than the median to one whose value is larger than the median?), so it might be better just to update occasionally based on the skip procedure.

wdberkeley
  • 10,754
  • 1
  • 20
  • 19
  • The skip method seems like it has too many side-effects for us. Also, when we have an even number of records we would need to calculate an average of the middle two and that logic would need to be built in. We ended up deciding to do the calcs outside of Mongo. – Crowson Mar 12 '15 at 13:16
  • @Crowson you need the average of the two middle values? Really? If you can prove that it makes a difference to your business, then I am sure I can fix your query for you. – Aron Mar 12 '15 at 13:23
  • I had been working under the assumption of doing the real median calculation. If we decide just to use the lower of the two middle values (when presented with an even number of records) would this skip method still be preferable? I can see the computational benefits and there may not be a significant loss for the business case. Thanks for the insight. – Crowson Mar 13 '15 at 14:36
0

We ended up updating the medians every page request, rather than in bulk with a cron job or something. We have a Node API that uses Mongo's aggregation framework to do the match/sort the user's results. The array of results then pass to a median function within Node. The results are then written back to Mongo for that user. Not super pleased with it, but it doesn't appear to have locking issues and is performing well.

Crowson
  • 111
  • 9