32

For what would be this query in SQL (to find duplicates):

SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1

I performed this simple query in MongoDB:

res = db.col.group({key:{userId:true,name:true}, 
                     reduce: function(obj,prev) {prev.count++;}, 
                     initial: {count:0}})

I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:

for (i in res) {if (res[i].count>1) printjson(res[i])};

Is there a better way to do this other than using javascript code in the client? If this is the best/simplest way, say that it is, and this question will help someone :)

shlomoid
  • 492
  • 1
  • 4
  • 11
  • 1
    check out the similar question http://stackoverflow.com/questions/4224773/fast-way-to-find-duplicates-on-indexed-column-in-mongodb – RameshVel Apr 05 '11 at 10:24
  • It's similar, but not the same. I'm using a group function, not the map-reduce functionality of MongoDB. – shlomoid Apr 05 '11 at 10:27
  • But this site which is linked from that answer helps, with a simple example: http://csanz.posterous.com/look-for-duplicates-using-mongodb-mapreduce – shlomoid Apr 05 '11 at 10:33

2 Answers2

61

New answer using Mongo aggregation framework

After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:

db.col.aggregate( [
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.

Old Pro
  • 22,324
  • 4
  • 52
  • 96
0

The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.

The original command

db.getCollection('so').explain(1).aggregate( [
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

Parts from the explain plan

{
    "stages" : [ 
        {
            "$cursor" : {
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "5fa42c8b8778717d277f67c4_test.so",
                    "indexFilterSet" : false,
                    "parsedQuery" : {},
                    "queryHash" : "F301762B",
                    "planCacheKey" : "F301762B",
                    "winningPlan" : {
                        "stage" : "PROJECTION_SIMPLE",
                        "transformBy" : {
                            "name" : 1,
                            "userId" : 1,
                            "_id" : 0
                        },
                        "inputStage" : {
                            "stage" : "COLLSCAN",
                            "direction" : "forward"
                        }
                    },
                    "rejectedPlans" : []
                },
                "executionStats" : {
                    "executionSuccess" : true,
                    "nReturned" : 6000,
                    "executionTimeMillis" : 8,
                    "totalKeysExamined" : 0,
                    "totalDocsExamined" : 6000,
  • The sampleset is pretty small, just 6000 documents
  • This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations

Now a small, hack and addition of an index.

 db.getCollection('so').createIndex({userId : 1, name : 1})                

New Command

db.getCollection('so').explain(1).aggregate( [
    {$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

Explain Plan

{
  "stages": [{
        "$cursor": {
          "queryPlanner": {
            "plannerVersion": 1,
            "namespace": "5fa42c8b8778717d277f67c4_test.so",
            "indexFilterSet": false,
            "parsedQuery": {
              "$and": [{
                  "name": {
                    "$not": {
                      "$eq": null
                    }
                  }
                },
                {
                  "userId": {
                    "$not": {
                      "$eq": null
                    }
                  }
                }
              ]
            },
            "queryHash": "4EF9C4D5",
            "planCacheKey": "3898FC0A",
            "winningPlan": {
              "stage": "PROJECTION_COVERED",
              "transformBy": {
                "name": 1,
                "userId": 1,
                "_id": 0
              },
              "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": {
                  "userId": 1.0,
                  "name": 1.0
                },
                "indexName": "userId_1_name_1",
                "isMultiKey": false,
                "multiKeyPaths": {
                  "userId": [],
                  "name": []
                },
                "isUnique": false,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": 2,
                "direction": "forward",
                "indexBounds": {
                  "userId": [
                    "[MinKey, undefined)",
                    "(null, MaxKey]"
                  ],
                  "name": [
                    "[MinKey, undefined)",
                    "(null, MaxKey]"
                  ]
                }
              }
            },
            "rejectedPlans": [{
              "stage": "PROJECTION_SIMPLE",
              "transformBy": {
                "name": 1,
                "userId": 1,
                "_id": 0
              },
              "inputStage": {
                "stage": "FETCH",
                "filter": {
                  "userId": {
                    "$not": {
                      "$eq": null
                    }
                  }
                },
                "inputStage": {
                  "stage": "IXSCAN",
                  "keyPattern": {
                    "name": 1.0
                  },
                  "indexName": "name_1",
                  "isMultiKey": false,
                  "multiKeyPaths": {
                    "name": []
                  },
                  "isUnique": false,
                  "isSparse": false,
                  "isPartial": false,
                  "indexVersion": 2,
                  "direction": "forward",
                  "indexBounds": {
                    "name": [
                      "[MinKey, undefined)",
                      "(null, MaxKey]"
                    ]
                  }
                }
              }
            }]
          },
          "executionStats": {
            "executionSuccess": true,
            "nReturned": 6000,
            "executionTimeMillis": 9,
            "totalKeysExamined": 6000,
            "totalDocsExamined": 0,
            "executionStages": {
              "stage": "PROJECTION_COVERED",
              "nReturned": 6000,
  • Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
  • This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
  • If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
  • Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally

*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.

Gandalf the White
  • 2,077
  • 2
  • 13
  • 38