Does compaction processes in C* influence on Spark jobs?

Question

I`m using cassandra 2.1.5 (.469) with spark 1.2.1.

I had performed a migration job with spark on big C* table (2,034,065,959 rows)- migrating it to another schema table (new_table), using:

some_mapped_rdd.saveToCassandra("keyspace", "new_table", writeConf=WriteConf(parallelismLevel = 50))

I can see in OpsCenter/Activities that C* doing some compaction tasks on the new_table, and it is going on for few days.

In addition, I`m trying to run another job, while the compaction tasks is still on, using:

    //join with cassandra
    val rdd = some_array.map(x => SomeClass(x._1,x._2)).joinWithCassandraTable(keyspace, some_table)

    //get only the jsons and create rdd temp table
    val jsons = rdd.map(_._2.getString("this"))
    val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
    jsonSchemaRDD.registerTempTable("this_json")

and it takes much longer then usual (usually I don`t perform huge migration tasks) to finish.

So does the compaction processes in C* influence on Spark jobs?

EDIT:

My table configured to SizeTieredCompactionStrategy (default) compaction strategy and I have 2882~ of 20M~ (and smaller, on 1 node out of 3) SSTable files, so I guess I should change the compaction_throughput_mb_per_sec parameter to higher value and go for DateTieredCompactionStrategy compaction strategy as my data is time series data.

Just curious for my own benefit, what compaction strategy are you using for your tables that show a lot of pending compactions? Some compaction strategies may be better suited for bulk loading and your query patterns. — Andy Tolbert, Jan 18 '16 at 16:39
It'd also be good to know how many SSTable files (*Data.db files) you have tables with lots of compaction activity. Wondering if you have a lot a small files, which could explain slow query times and also the need for increased compaction activity. — Andy Tolbert, Jan 18 '16 at 16:41
@ Andy Tolbert - please see my edit. I wondered if replication factor (I set it to 2 when I have 3 nodes) influence the compaction of the SSTable or/and the rate of read and writes to the cluster? — Reshef, Jan 18 '16 at 17:51

score 1 · Accepted Answer · edited May 23 '17 at 10:27

In terms of compaction potentially using a lot of system resources it can influence your Spark Jobs from a performance standpoint. You can control how much throughput compactions can perform at a time via compaction_throughput_mb_per_sec.

On the other hand, reducing compaction throughput will make your compactions take longer to complete.

Additionally, the fact that compaction is happening could mean that how your data is distributed among sstables is not optimal. So it could be that compaction is a symptom of the issue, but not the actual issue. In fact it could be the solution to your problem (over time as it makes more progress).

I'd recommend taking a look the cfhistograms output of your tables you are querying to see how many SSTables are being hit per read. That could be a good indicator that something is unoptimal - such as needing to change your configuration (i.e. memtable flush rates) or optimize or change your compaction strategy.

This answer provides a good explanation on how to read cfhistograms output.

Does compaction processes in C* influence on Spark jobs?

1 Answers1