Adding a new field to 100 million records in mongodb

Question

What is the fastest and safest strategy for adding a new field to over 100 million mongodb documents?

Background

Using mongodb 3.0 in a 3 node replica set
We are adding a new field (post_hour) that is based on data in another field (post_time) in the current document. The post_hour field is a truncated version of post_time to the hour.

score 1 · Accepted Answer · answered Jun 23 '16 at 09:23

I faced a similar scenario and in which I had create a script to update around 25 Million documents and it was taking a lot of time to update all the documents. To improve performance, I one by one inserted the updated document into a new collection and renamed the new collection.This approach helped because I was inserting the documents rather than updating them ('insert' operation is faster than 'update' operation).

Here is the sample script(I have not tested it):

/*This method returns postHour*/
function convertPostTimeToPostHour(postTime){
}

var totalCount = db.person.count();
var chunkSize = 1000;
var chunkCount = totalCount / chunkSize;
offset = 0;
for(index = 0; index<chunkCount; index++){
    personList = db.persons.find().skip(offset).limit(chunkSize);
    personList.forEach(function (person) {
        newPerson = person;
        newPerson.post_hour = convertPostTimeToPostHour(person.post_time);
        db.personsNew.insert(newPerson); // This will insert the record in a new collection
    });
    offset += chunkSize;
}

When the above written script will get executed, the new collection 'personNew' will have the updated records with value of field 'post_hour' set.

If the existing collection is having any indexes, you need to recreate them in the new collection.

Once then indexes are created, you can rename the name of collection 'person' to 'personOld' and 'personNew' to 'person'.

I Guess performing each chunk in separate shell can also boost up its speed performance. — , Jun 23 '16 at 09:27
Using 'insert' solved the problem. But I suppose it is better to use cursor on all data set rather than skip/limit and bulk write to perform insert operations. — peschanko, Aug 04 '19 at 06:34

score -1 · Answer 2 · answered Jun 23 '16 at 08:18

The snapshot will allow to prevent duplicates in query result (as we are extending size) - can be removed if any trouble happen.

Please find mongo shell script below where 'a1' is collection name:

var documentLimit = 1000;

var docCount = db.a1.find({
        post_hour : {
            $exists : false
        }
    }).count();

var chunks = docCount / documentLimit;

for (var i = 0; i <= chunks; i++) {
    db.a1.find({
        post_hour : {
            $exists : false
        }
    }).snapshot()
      .limit(documentLimit)
      .forEach(function (doc) {
        doc.post_hour = 12; // put your transformation here
        // db.a1.save(doc); // uncomment this line to save data 
                            // you can also specify write concern here
        printjson(doc);     // comment this line to avoid polution of shell output
                            // this is just for test purposes    
    });
}

You can play with parameters, but as bulk is executed in 1000 records blocks, that looks optimal.

Adding a new field to 100 million records in mongodb

2 Answers2