Does nodetool cleanup affect Apache Spark rdd.count() of a Cassandra table?

Question

I've been tracking the growth of some big Cassandra tables using Spark rdd.count(). Up 'till now the expected behavior was consistent, the number of rows is constantly growing.

Today I ran nodetool cleanup on one of the seeds and as usual it ran for a 50+ minutes.

And now rdd.count() returns one third of the rows it did before....

Did I destroy data using nodetool cleanup? Or is the Spark count unreliable and was counting ghost keys? I got no errors during cleanup and lots don't show anything out of the usual. It did seem like a successful operation, until now.

Update 2016-11-13

Turns out the Cassandra documentation set me up for the loss of 25+ million rows of data.

The documentation is explicit:

Use nodetool status to verify that the node is fully bootstrapped and all other nodes are up (UN) and not in any other state. After all new nodes are running, run nodetool cleanup on each of the previously existing nodes to remove the keys that no longer belong to those nodes. Wait for cleanup to complete on one node before running nodetool cleanup on the next node.

Cleanup can be safely postponed for low-usage hours.

Well you check the status of the other nodes via nodetool status and they are all UP and Normal (UN), BUT here's the catch, you also need to run the command is nodetool describecluster where you might find that the schemas were not synced.

My schemas were not synced and I ran cleanup, when all nodes were UN, up and running normally as per the documentation. The Cassandra documentation does not mention nodetool describecluster after adding new nodes.

So I merrily added nodes, waited till they were UN (Up / Normal) and ran cleanup.

As a result, 25+ million rows of data are gone. I hope this helps others avoid this dangerous pitfall. Basically the Datastax documentation sets you up to destroy data by recommending cleanup as a step of the process of adding new nodes.

In my opinion, that cleanup step should be taken out of the new node procedure documentation altogether. It should be mentioned, elsewhere, that cleanup is a good practice but not in the same section as adding new nodes...it's like recommending rm -rf / as one of the steps for virus removal. Sure will remove the virus...

Thank you Aravind R. Yarram for your reply, I came to the same conclusion as your reply and came here to update this. Appreciate your feedback.

score 1 · Accepted Answer · answered Nov 13 '16 at 00:20

1

I am guessing you might have either added/removed nodes from the cluster or decreased replication factor before running nodetool cleanup. Until you run the cleanup, I guess Cassandra still reports the old key ranges as part of the rdd.count() as old data still exists on those nodes.

Reference: https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCleanup.html

answered Nov 13 '16 at 00:20

Aravind Yarram

74,434
44
210
298

Turns out this is exactly what happened. This is so frustrating, the Cassandra documentation says to run cleanup explicitly after adding nodes. – Jose Fonseca Nov 13 '16 at 12:12

Does nodetool cleanup affect Apache Spark rdd.count() of a Cassandra table?

1 Answers1