0

I'm trying to append an entry to an existing RDD each iteration of a loop. My code until now is:

var newY = sc.emptyRDD[MatrixEntry]
for (j <- 0 until 8000) {
  var arrTmp = Array(MatrixEntry(j, j, 1))
  var rddTmp = sc.parallelize(arrTmp)
  newY = newY.union(rddTmp)
}

Making these 8000 iterations I get an error when I try to take(10) from that RDD but if I try with smaller number every thing is ok. The error Exception in thread "main" java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.List.map(List.scala:296) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:84) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121)

Help?

Ardit Meti
  • 507
  • 3
  • 19

1 Answers1

2

The problem you get is a duplicate of Stackoverflow due to long RDD Lineage, but your code shouldn't be with at all.

If you want identity matrix just map with range:

val newY = spark.sparkContext.range(0, 8000).map(j => MatrixEntry(j, j, 1))

Loop with parallelize doesn't scale and keeps all data in the driver memory Why does SparkContext.parallelize use memory of the driver?

Alper t. Turker
  • 29,733
  • 7
  • 65
  • 101