How to setup Spark with a multi node Cassandra cluster?

Question

First of all, I am not using the DSE Cassandra. I am building this on my own and using Microsoft Azure to host the servers.

I have a 2-node Cassandra cluster, I've managed to set up Spark on a single node but I couldn't find any online resources about setting it up on a multi-node cluster.

This is not a duplicate of how to setup spark Cassandra multi node cluster?

To set it up on a single node, I've followed this tutorial "Setup Spark with Cassandra Connector".

egorlitvinenko · Accepted Answer · 2017-09-07T17:55:12.193

You have two high level tasks here:

setup Spark (single node or cluster);
setup Cassandra (single node or cluster);

This tasks are different and not related (if we are not talking about data locality). How to setup Spark in Cluster you can find here Architecture overview. Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements. As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs. Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes. Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.

For Cassandra see this docs: 1, 2, tutorials 3, etc. You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.

p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

How to setup Spark with a multi node Cassandra cluster?

1 Answers1