7

We are trying to allocate same executor and same partitioner for RDDs to avoid any network traffic and also the shuffle operations like cogroup and joins do not have any stage boundaries and all transformations get completed under one stage.

So to achieve this we wrap the RDD with our custom RDD class(ExtendRDD.class) in Java that has a overriden getPreferredLocation function from RDD.class(in scala) as :

 public Seq<String> getPreferredLocations(Partition split){
        listString.add("11.113.57.142");
        listString.add("11.113.57.163");
        listString.add("11.113.57.150");
        List<String> finalList = new ArrayList<String>();
        finalList.add(listString.get(split.index() % listString.size()));               

        Seq<String> toReturnListString = scala.collection.JavaConversions.asScalaBuffer(finalList).toSeq();

        return toReturnListString;
    } 

With this we are able to control the behaviour of the spark as to which node it puts the RDD in the cluster. But the problem now is, since the partitioner for these RDDs being different, spark considers them to be shuffle dependent and again creates multiple stages for these shuffle operations. We tried to override the partitioner method of the same RDD.class in the same custom RDD as :

public Option<Partitioner> partitioner() {
        Option<Partitioner> optionPartitioner = new Some<Partitioner>(this.getPartitioner());
        return optionPartitioner;
    } 

For spark to put them under same stage it must consider these RDDs to be coming from same partitioner. Our partitioner method does not seem to work as spark takes the different partitioner for 2 RDDs and creates multiple stages for shuffle operations.

We wrapped the scala RDD with our custom RDD as :

ClassTag<String> tag = scala.reflect.ClassTag$.MODULE$.apply(String.class);
RDD<String> distFile1 = jsc.textFile("SomePath/data.txt",1);
ExtendRDD<String> extendRDD = new ExtendRDD<String>(distFile1, tag);

We create another custom RDD in similar way and get a PairRDD(pairRDD2) out of that RDD. Then we try to apply the same partitioner as in extendRDD object to the PairRDDFunction object using partitionBy function and then apply cogroup to that:

RDD<Tuple2<String, String>> pairRDD = extendRDD.keyBy(new KeyByImpl());
PairRDDFunctions<String, String> pair = new PairRDDFunctions<String, String>(pairRDD, tag, tag, null);
pair.partitionBy(extendRDD2.getPartitioner());
pair.cogroup(pairRDD2);

All this does not seem to work as spark creates multiple stages when it encounters cogroup transformation.

Any suggestions to how can we apply the same partitioner to the RDDs?

Aviral Kumar
  • 840
  • 1
  • 13
  • 36

0 Answers0