Update - Short version:
The PropertyFileSnitch cassandra-topology.properties
for the first 3 nodes (Rack 1-3) states that only these nodes are in DC1 and the others are in DC2 by specifying the default value default=DC2:r1
. When the cluster was scaled up by adding nodes 4 and 5 the PropertyFileSnitch for these nodes was configured to add them in DC1 as well in Rack 4 and 5 but the snitch from the first 3 nodes remained unchanged and as a result the cluster is in this inconsistent state.
My question is if this cluster can be rebalanced (fixed). Would it suffice if I did a full cluster restart after fixing the cassandra-topology.properties
?
Please advise on how I can safely rebalance the cluster.
Longer version:
I am new to Cassandra and I started working on an already built cluster.
I have 5 nodes in the same data center on different racks running Cassandra version 3.0.5 with vnodes num_tokens: 256
and a keyspace with replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3'} AND durable_writes = true
.
Historically there were only 3 nodes and the cluster was scaled up with an additional 2 nodes. I have an automatic repair script that runs nodetool repair
with options parallelism: parallel, primary range: false, incremental: true, job threads: 1
.
After a large amount of data was inserted the problems started to appear. When running the repair script on node 4 or 5 the node 2 gets overloaded: the CPU usage stays at 100%, the MutationStage queue grows and the GC pauses take at least 1s until the Cassandra process finally dies. The repair result is usually failed with error Stream failed (progress: 0%)
.
When running the nodetool status
command on nodes 1, 2 or 3 I get the following output:
Datacenter: DC2 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.13 10.68 GB 256 0.0% 75e17b8a r1 UN 10.0.0.14 9.43 GB 256 0.0% 21678ddb r1 Datacenter: DC1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.10 16.14 GB 256 100.0% cf9d327f Rack1 UN 10.0.0.11 22.83 GB 256 100.0% e725441e Rack2 UN 10.0.0.12 19.66 GB 256 100.0% 95b5c8e3 Rack3
But when running the nodetool status
command on nodes 4 or 5 I get the following output:
Datacenter: DC1 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.0.13 10.68 GB 256 58.9% 75e17b8a Rack4 UN 10.0.0.14 9.43 GB 256 61.1% 21678ddb Rack5 UN 10.0.0.10 16.14 GB 256 60.3% cf9d327f Rack1 UN 10.0.0.11 22.83 GB 256 61.4% e725441e Rack2 UN 10.0.0.12 19.66 GB 256 58.3% 95b5c8e3 Rack3
After further investigation it seems that the PropertyFileSnitch cassandra-topology.properties
was not updated on nodes 1, 2 and 3 (which are also the seeds for this cluster) after the cluster was scaled up.
Thanks!