Achieving zero downtime Cassandra/DataStax migrations

Question

I've got a Cassandra cluster (3 nodes, all nodes deployed to AWS) that I am trying to migrate over to a DataStax cluster. It's simply time to stop managing these nodes myself.

I have multiple producers and consumers all reading/writing data, all day long, to my Cassandra cluster. I don't have the option of putting an app/service/proxy in front of my Cassandra cluster, and then just flipping the switch cleanly so that all reads/writes go to/from my Cassandra, over to DataStax. So there's no clean way to migrate the tables one at a time. I'm also trying to achieve zero (or near zero) downtime for all producers/consumers of the data. One hard requirement: the migration cannot be lossy. No lost data!

I'm thinking the best strategy here is a four step process:

Somehow, configure DataStax to be a replica of my Cassandra cluster, effectively creating streaming replication to DataStax
Once DataStax is totally "caught up" with the other nodes in my Cassandra, keep the producers writing to my current Cassandra cluster, but cut the consumers/readers over to DataStax (that is, reconfigure them to connect to DataStax, and then restart them). Not zero downtime but I can probably live with a simple restart. (Again, zero downtime solutions are greatly preferred.)
Cut the producers over to DataStax. Again, only near-zero-downtime, as this involves reconfiguring the producers to point to DataStax, and then requires a restart to pick up the new configs. Zero-downtime solutions would be preferred.
Once replication traffic from the "old" Cassandra cluster drains to zero, we now have no "new" information that my non-DataStax nodes need to write to DataStax. Kill those nodes with fire.

This solution is the most minimally-invasive, closest-to-zero-downtime solution I can come up with, but assumes a few things:

Perhaps it is not possible to treat DataStax like an extra node that can be replicated to (yes/no?)
Perhaps Cassandra and/or DataStax have some magical features/capabilities that I don't know about, that can handle migrations better than this solution; or perhaps there are 3rd party (ideally open source) tools that could handle this better
I have no idea how I would monitor replication "traffic" coming from the "old" Cassandra nodes into DataStax. Would need to know how to do this before I could safely shutdown + kill the old nodes (again, can't lose data).

I guess I'm wondering if this strategy is: (1) doable/feasible, and (2) optimal; and if there are any features/tools in the Cassandra/DataStax ecosystem that I could leverage to make this any better (faster and with zero downtime).

score 4 · Answer 1 · answered Jan 31 '17 at 10:26

The four steps you've outlined is definitely a viable option to go. There's also the route of doing a simple rolling binary install: https://docs.datastax.com/en/latest-upgrade/upgrade/datastax_enterprise/upgrdCstarToDSE.html

I'll speak in the context of the steps you provided above. If you're curious about the rolling binary install, we can definitely chat about that as well.

Note doc links are specific to Cassandra 3.0 (DataStax 5.0) - make sure the doc versions match your Cassandra version.

If the current major Cassandra version == current major Cassandra version in DataStax, you should be able to add the 'DataStax' nodes as a new DC in the same cluster your current Cassandra environment belongs to following: http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html - That will bring in the existing data from existing Cassandra DC to DataStax DC.

If you're mismatching Cassandra versions (current Cassandra is older/newer than DataStax Cassandra), then you may want to reach out to DataStax via https://academy.datastax.com/slack as the process will be more specific to your environment and can vary greatly.

As outlined in the docs, you'll want to run

ALTER KEYSPACE "your-keyspace" WITH REPLICATION =
{'class’: 'NetworkTopologyStrategy', 'OldCassandraDC':3, 'DataStaxDC':3};

(obviously changing DC name and replication factor to your specs)

This will make sure new data from your producers will replicate to the new DataStax nodes.

You can then run nodetool rebuild -- name_of_existing_data_center from the DataStax nodes to stream data over from the existing Cassandra nodes. Depending on how much data there is, it may be somewhat time consuming but it's the easiest, most hands off way to do it.

You would then want to update the contact points in your producers/consumers one by one before decommissioning the old Cassandra DC.

A few tips from my experience:

Make sure your DataStax nodes are using GosspingPropertyFileSnitch in the cassandra.yaml before starting those nodes.
When running nodetool rebuild, do it with screen so that you can see when it completes (or errors), Otherwise, you would have to monitor progress by using nodetool netstats and check streaming activity.
Have OpsCenter up and running to monitor what's going on in the DataStax cluster during the rebuilds. You can keep an eye on streaming throughput, pending compactions, and other Cassandra specific metrics.
When it comes time to decommission the old DC, make sure you follow these steps: http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsDecomissionDC.html

Hope that helps!

FWIW I'm sitting in a training session with several DataStax engineers (actual engineers working for DataStax) and they are saying this solution isn't possible, and that cross-DC replication does not work/exist. Not saying @MarcintheCloud is wrong, just sharing this info for any future-comers. — smeeb, Feb 21 '17 at 15:48
Thanks for following up Smeeb - if you want to email me at marc.selwan@datastax.com, happy to discuss this further :) — MarcintheCloud, Feb 22 '17 at 16:02

score 2 · Answer 2 · answered Feb 01 '17 at 00:13

I presume you mean the Datastax Managed product, where they run cassandra for you. If you just mean "run DSE on your own AWS instances", you can do a binary upgrade in-place.

The questions you asked are best asked of Datastax - if you're going to pay them, you may as well ask them questions (that's what customers do).

Your 4 step approach is mostly pretty logical, but probably overly complex. Most cassandra drivers will auto-discover new hosts, and auto-evict old/leaving hosts, so once you have all the new Datastax Managed nodes in the cluster (assuming they allow that), you can run repair to guarantee consistency, then decommission your existing nodes - your app will keep working (isn't Cassandra great?). You'll want to update your app config to have the new Datastax Managed nodes in your app config / endpoints, but that doesn't need to be done in advance.

The one caveat here is the latency involved - going from your environment to Datastax Managed may introduce latency. In that case, you have an intermediate step you can consider where you add the Datastax Managed nodes as a different "Datacenter" within cassandra, expand the replication factor, and use LOCAL_ consistency levels to control which DC gets the queries (and then you CAN move your producers/consumers over individually).

Achieving zero downtime Cassandra/DataStax migrations

2 Answers2