I`m using cassandra 2.1.5 (.469) with spark 1.2.1.
I had performed a migration job with spark on big C* table (2,034,065,959 rows)- migrating it to another schema table (new_table), using:
some_mapped_rdd.saveToCassandra("keyspace", "new_table", writeConf=WriteConf(parallelismLevel = 50))
I can see in OpsCenter/Activities that C* doing some compaction tasks on the new_table, and it is going on for few days.
In addition, I`m trying to run another job, while the compaction tasks is still on, using:
//join with cassandra
val rdd = some_array.map(x => SomeClass(x._1,x._2)).joinWithCassandraTable(keyspace, some_table)
//get only the jsons and create rdd temp table
val jsons = rdd.map(_._2.getString("this"))
val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
jsonSchemaRDD.registerTempTable("this_json")
and it takes much longer then usual (usually I don`t perform huge migration tasks) to finish.
So does the compaction processes in C* influence on Spark jobs?
EDIT:
My table configured to SizeTieredCompactionStrategy (default) compaction strategy and I have 2882~ of 20M~ (and smaller, on 1 node out of 3) SSTable files, so I guess I should change the compaction_throughput_mb_per_sec parameter to higher value and go for DateTieredCompactionStrategy compaction strategy as my data is time series data.