3

I have a MongoDB database with a collection of site-events. The documents look like:

{ 
    "_id" : ObjectId("5785bb02eac0636f1dc07023"), 
    "referrer" : "https://example.com",
    "_t" : ISODate("2016-07-12T18:10:17Z"),
    "_p" : "ucd7+hvjpacuhtgbq1caps4rqepvwzuoxm=", 
    "_n" : "visited site", 
    "km screen resolution" : "1680x1050" 
},

{ 
    "_id" : ObjectId("5785bb02eac0636f1dc07047"), 
    "url" : "https://www.example.com/", 
    "referrer" : "Direct", 
    "_t" : ISODate("2016-07-12T18:10:49Z"), 
    "_p" : "txt6t1siuingcgo483aabmses2et5uqk0=", 
    "_n" : "visited site", 
    "km screen resolution" : "1366x768" 
},

{ 
    "_id" : ObjectId("5785bb02eac0636f1dc07053"), 
    "url" : "https://www.example.com/", 
    "referrer" : "Direct", 
    "_t" : ISODate("2016-07-12T18:10:56Z"), 
    "_p" : "gcama1az5jxa74wa6o9r4v/3k+zulciqiu=", 
    "_n" : "visited site", 
    "km screen resolution" : "1366x768" 
}

I want to get a count of the unique persons within a date range. In SQL it would be

SELECT COUNT(DISTINCT(`_p`)) FROM collection WHERE `_t` > '<SOME DATE>' AND `_t` <= '<SOME OTHER DATE>'

So far, I've grouped the dates along using the aggregation pipeline:

db.siteEvents.aggregate(

[
    { 
        $match : {"_n": "visited site"}
    },

    {   
        $group : {

            _id: {
            year : { $year : "$_t" },        
            month : { $month : "$_t" },        
            day : { $dayOfMonth : "$_t" },
            _p : "$_p"
        },

        count: { $sum: 1 }

        }
    }, 

    {
        $group : {

            _id : {
            year : { $year : "$_id.year" },        
            month : { $month : "$_id.month" },        
            day : { $dayOfMonth : "$_id.day" }
            },

            count: { $sum: 1 }
        }
    }
]

);

But this gives errors - I believe because of the second grouping _id trying to grab an intermediate field. I'm currently just using the Mongo shell, but if I had to choose an alternative driver it would be PyMongo. I'd like to get this to work in the shell (so I can understand the process).

styvane
  • 49,879
  • 15
  • 119
  • 132
Jared
  • 2,713
  • 10
  • 31
  • 59

2 Answers2

6

With an aggregation pipeline it could look like so

db.getCollection('siteEvents').aggregate([
    {
        $match: {
            _t: {
                $gt: ISODate("2016-07-11T08:10:17.000Z"),
                $lt: ISODate("2016-07-12T14:10:17.000Z")
            }
        }
    },
    {
        $group: {
            _id: "$_p"
        }
    },
    {
        $group: {
            _id: null,
            distinctCount: { $sum: 1 }
        }
    }
])

If you know the resulting distinct values won't be large then you could use a simply query like so

db.getCollection('siteEvents').distinct(
    '_p',
    { 
        _t: {
            $gt: ISODate("2016-07-11T08:10:17.000Z"),
            $lt: ISODate("2016-07-12T14:10:17.000Z")
        }
    }).length
DAXaholic
  • 28,212
  • 5
  • 58
  • 67
  • Can you explain the purpose of the `_id : null` in the last group? The rest is clear - and I understand what the intermediate docs look like. – Jared Jul 13 '16 at 20:06
  • null in this case is simply a constant, you could also use whatever other constant you want e.g. _id: "MyDistinctCount". The purpose is just to ensure that the group will only produce a single result so that it counts the distinct values of the intermediate docs. – DAXaholic Jul 14 '16 at 03:42
0

You can use the $addToSet operator in the $group stage to return an array of distinct "_p" value then $project the resulted document to return the size of the array which is nothing other than the distinct count.

db.siteEvents.aggregate(
    [
        {"$match": {"_n": "visited site", "_t": {"$gt": <SOME DATE>, "$lt": <SOME OTHER DATE>}}},
        {"$group": {
            "_id": None, 
            "_p_values": {"$addToSet": "$_p"} 
        }}, 
        {"$project": {"_id": 0, "count": {"$size": "$_p_values"}}} 
    ]   
)

For small size collection you can simply use distinct but you need to pass in the query argument.

len(db.siteEvents.distinct("_p", {"_n": "visited site", "_t": {"$gt": <SOME DATE>, "$lt": <SOME OTHER DATE>}}))
styvane
  • 49,879
  • 15
  • 119
  • 132