The repair of cassandra data

Question

I have a playground for cassandra cluster - 7 nodes (v.2.2.4) on servers hardware without network problems. RF is equal 3. For loading data was started a script generating test data. A "Table" has about 2b records.

I ran subrange repair procedure during executing of the script. In result, repairing some segments was failed. Then was start the upgradesstables procedure which was executed with errors too, so sstablescrub was started. After a sstablescrub procedure, a repair was failed on some segments again.

Which reasons can provide repair troubles in my case?

Should sstablescrub be started on each node of the cluster?

The script for subrange repair. I hope it may be useful for somebody.

ring=( $($vCSBIN/nodetool ring | grep -oE '[-]?[0-9]{19}') )

for ((i=0; i<$((${#ring[@]}-1)); i++));
    do
        echo "st = ${ring[i]}, et = ${ring[i+1]}"
        $vCSBIN/nodetool repair -st "${ring[i]}" -et "${ring[i+1]}"
    done

The fragment of system.log:

INFO  [Thread-59449] 2016-02-22 13:14:08,916 RepairSession.java:237 - 

[repair #110b9d40-d990-11e5-a89b-41ca3fbac573] new session: will sync cassandra1111.mydomain.com/10.0.0.0.77, /10.0.0.0.85, /10.0.0.0.192 on range (-4991002611964638502,-4985574971950992136] for ks1.[t1, counters]
INFO  [Repair#1997:1] 2016-02-22 13:14:08,918 RepairJob.java:107 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] requesting merkle trees for t1 (to [/10.0.0.0.85, /10.0.0.0.192, cassandra1111.mydomain.com/10.0.0.0.77])
INFO  [Repair#1997:1] 2016-02-22 13:14:08,918 RepairJob.java:181 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] Requesting merkle trees for t1 (to [/10.0.0.0.85, /10.0.0.0.192, cassandra1111.mydomain.com/10.0.0.0.77])
ERROR [ValidationExecutor:7] 2016-02-22 13:14:08,919 Validator.java:246 - Failed creating a merkle tree for [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]], /10.0.0.0.77 (see log for details)
INFO  [AntiEntropyStage:1] 2016-02-22 13:14:08,920 RepairSession.java:181 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] Received merkle tree for t1 from /10.0.0.0.77
WARN  [RepairJobTask:1] 2016-02-22 13:14:08,920 RepairJob.java:162 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] t1 sync failed
INFO  [Repair#1997:2] 2016-02-22 13:14:08,920 RepairJob.java:107 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] requesting merkle trees for counters (to [/10.0.0.0.85, /10.0.0.0.192, cassandra1111.mydomain.com/10.0.0.0.77])
org.apache.cassandra.exceptions.RepairException: [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]] Validation failed in cassandra1111.mydomain.com/10.0.0.0.77
INFO  [Repair#1997:2] 2016-02-22 13:14:08,920 RepairJob.java:181 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] Requesting merkle trees for counters (to [/10.0.0.0.85, /10.0.0.0.192, cassandra1111.mydomain.com/10.0.0.0.77])
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.cassandra.exceptions.RepairException: [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]] Validation failed in cassandra1111.mydomain.com/10.0.0.0.77
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]] Validation failed in cassandra1111.mydomain.com/10.0.0.0.77
INFO  [AntiEntropyStage:1] 2016-02-22 13:14:08,920 RepairSession.java:181 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] Received merkle tree for t1 from /10.0.0.0.85
ERROR [RepairJobTask:1] 2016-02-22 13:14:08,921 RepairSession.java:290 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] Session completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]] Validation failed in cassandra1111.mydomain.com/10.0.0.0.77
INFO  [AntiEntropyStage:1] 2016-02-22 13:14:08,921 RepairSession.java:181 - [repair #110b9d40-d990-11e5-a89b-41ca3fbac573] Received merkle tree for t1 from /10.0.0.0.192
ERROR [RepairJobTask:1] 2016-02-22 13:14:08,921 RepairRunnable.java:243 - Repair session 110b9d40-d990-11e5-a89b-41ca3fbac573 for range (-4991002611964638502,-4985574971950992136] failed with error [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]] Validation failed in cassandra1111.mydomain.com/10.0.0.0.77
org.apache.cassandra.exceptions.RepairException: [repair #110b9d40-d990-11e5-a89b-41ca3fbac573 on ks1/t1, (-4991002611964638502,-4985574971950992136]] Validation failed in cassandra1111.mydomain.com/10.0.0.0.77

Please add the system.log entries here from the affected node. (probably /var/log/cassandra/system.log) — bechbd, Feb 24 '16 at 18:31
"Then was start the upgradesstables procedure which was executed with errors too, so sstablescrub was started" --> why are you running upgradesstables after repair ? Upgrade sstable is only required when you're upgrading Cassandra version ... — doanduyhai, Feb 24 '16 at 18:59
@doanduyhai https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-repair says that it needs to run `nodetool scrub`, then `sstablescrub`. But http://docs.datastax.com/en/cassandra/2.2/cassandra/tools/toolsScrub.html says "if possible use nodetool upgradesstables." . I've tried `nodetool repair` -> `nodetool scrub` -> `sstablescrub` (on some nodes). After I started `repair` and got a fail status. Now I'm trying to execute sstablescrub on all nodes. — Dimaf, Feb 24 '16 at 19:19
Ok I didn't have all the context. Please notice that you should probably wait for the nodetool repair to stop (all repair sessions succeeded or failed) before attempting to do a nodetool scrub or sstablescrub — doanduyhai, Feb 24 '16 at 19:38
@doanduybai, of course, tasks were started in series. The main question - why repair isn't successful? — Dimaf, Feb 24 '16 at 19:43
Are you aware that cassandra must be *offline* when running `sstablescrub`? — Dirk Lachowski, Mar 10 '16 at 16:55
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableScrub_t.html "2. Shut down the node." — Dimaf, Mar 10 '16 at 17:27
maybe you can set Cassandra logging level to DEBUG when you do repair. This can collection more information. I saw an example of how to do that: http://www.informit.com/articles/article.aspx?p=2169293 — Zhong Hu, Sep 02 '17 at 04:10

The repair of cassandra data

0 Answers0