nodetool repair across replicas of data center

Question

Just want to understand the performance of 'nodetool repair' in a multi data center setup with Cassandra 2.

We are planning to have keyspaces with 2-4 replicas in each data center. We may have several tens of data centers. Writes are done with LOCAL_QUORUM/EACH_QUORUM consistency depending on the situation and reads are usually done with LOCAL_QUORUM consistency. Questions:

Does nodetool repair complexity grow linearly with number of replicas across all data centers?
Or does nodetool repair complexity grow linearly with a combination of number of replicas in the current data center, and number of data centers? Vaguely, this model could possibly sync data with each of the individual nodes in current data center, but at EACH_QUORUM-like operation against replicas in other data centers.
To scale the cluster, is it better to add more nodes in an existing data center or add a new data center assuming constant number of replicas as a whole? I ask this question in the context of nodetool repair performance.

score 8 · Accepted Answer · answered Aug 04 '14 at 05:09

To understand how nodetool repair affects the cluster or how the cluster size affects repair, we need to understand what happens during repair. There are two phases to repair, the first of which is building a Merkle tree of the data. The second is having the replicas actually compare the differences between their trees and then streaming them to each other as needed.

This first phase can be intensive on disk io since it will touch almost all data on the disk on the node on which you run the repair. One simple way to avoid repair touching the full disk is to use the -pr flag. When using -pr, it will disksize/RF instead of disksize data that repair has to touch. Running repair on a node also sends a message to all nodes that store replicas of any of these ranges to build merkle trees as well. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data.

The factor which determines how the repair operation affects other data centers is the use of the replica placement strategy. Since you are going to need consistency across data centers (EACH_QOURUM cases) it is imperative that you use a cross-dc replication strategy like the Network Topology strategy in your case. For repair this will mean that you cannot limit yourself to local dc while running the repair since you have some EACH_QUORUM consistency cases. To avoid a repair affecting all replicas in all data centers, you should a) Wrap your replication strategy using Dynamic snitch and configure the badness threshold properly b) Use -snapshot option while running the repair. What this will do is take a snapshot of your data (snapshots are just hardlinks to existing sstables, exploiting the fact that sstables are immutable, thus making snapshots extremely cheap) and sequentially repair from the snapshot. This means that for any given replica set, only one replica at a time will be performing the validation compaction, allowing the dynamic snitch to maintain performance for your application via the other replicas.

Now we can answer the questions you have.

Does nodetool repair complexity grow linearly with number of replicas across all data centers? You can limit this by wrapping your replication strategy with Dynamic snitch and pass -snapshot option during repair.
Or does nodetool repair complexity grow linearly with a combination of number of replicas in the current data center, and number of data centers? Vaguely, this model could possibly sync data with each of the individual nodes in current data center, but at EACH_QUORUM-like operation against replicas in other data centers. The complexity will grow in terms of running time with the number of replicas if you use the approach above. This is because the above approach will do a sequential repair on one replica at a time.
To scale the cluster, is it better to add more nodes in an existing data center or add a new data center assuming constant number of replicas as a whole? I ask this question in the context of nodetool repair performance. From nodetool repair perspective IMO, this does not make any difference if you take the above approach. Since it depends on the overall number of replicas.

Also, the goal of repair using nodetool is so that deletes do not come back. The hard requirement for routine repair frequency is the value of gc_grace_seconds. In systems that seldom delete or overwrite data, you can raise the value of gc_grace with minimal impact to disk space. This allows wider intervals for scheduling repair operations with the nodetool utility. One of the recommended ways to avoid frequent repairs is to have immutability of records by design. This may be important to you since you need to run on a tens of data centers and ops will otherwise already be painful.

OP, please accept this as the answer if you are the person that upvoted this answer. — trulite, Aug 04 '14 at 18:16
For the cases with write using EACH_QUORUM, isn't it sufficient do repair within one data center? Write has already been done with a higher level of consistency. The nodes that need to be corrected are ones that did not participate in the quorum write. Is this not correct? — vishr, Aug 04 '14 at 22:28

nodetool repair across replicas of data center

1 Answers1