0

My graph contains vertices with different properties classes. I want to filter vertices with a specific property and then sort them. Here is how my code looks like:

class VertexProperty()
case class Property1(val name: String, val servings: Int) extends VertexProperty
case class Property2(val description: String) extends VertexProperty

val vertexArray = Array(
(1L, Property1("propertyName",8)),
(2L, Property1("propertyName",4)),
(3L, Property2("description"))
)

val edgeArray = Array(
 Edge(1L, 2L, "step1"),
 Edge(1L, 3L, "step2")
 )

val vertexRDD: RDD[(Long, VertexProperty)] = sc.parallelize(vertexArray) 
val edgeRDD: RDD[Edge[String]] = sc.parallelize(edgeArray)
val graph: Graph[VertexProperty, String] = Graph(vertexRDD, edgeRDD)

I want to get vertices with property1 only and this code is working fine:

val vertices = graph.vertices.filter{
  case (id, vp: Property1) => vp.description != ""
  case _ => false
}

That is the result:

(1L, Property1("propertyName",8)), (2L, Property1("propertyName",4))

Now, problem is that I want to get these vertices sorted by "servings" that is 2nd parameter of Property1 class. I can sort this result by vertex id:

vertices.collect().sortBy(_._1).foreach(println)

but this don't work.

vertices.collect().sortBy(_._2._2).foreach(println)
Nargis
  • 705
  • 5
  • 28

1 Answers1

2

Convert VertexProperty to trait (or make parent class Serializable)

sealed trait VertexProperty
case class Property1(name: String, servings: Int) extends VertexProperty
case class Property2(description: String) extends VertexProperty

Make sure that types match:

val vertexArray: Array[(Long, VertexProperty)] = Array(
  (1L, Property1("propertyName",8)),
  (2L, Property1("propertyName",4)),
  (3L, Property2("description"))
)

Collect instead of filter:

val vertices: RDD[(Long, Property1)] = graph.vertices.collect {
  case (id, p @ Property1(name, _)) if name != "" => (id, p)
}

Resulting RDD will be RDD[(Long, Property1)] and you can sort it by Property1 fields.

Note:

  1. It might not work in REPL without additional tweaks. See Case class equality in Apache Spark and follow the instructions if necessary.

  2. collect { }'s behavior is different than collect(). The first one returns an RDD that contains all matching values by applying f, whereas the latest collects and returns to the driver an array that contains all of the elements in this RDD.

  3. You cannot sortBy(_._2._2), because Property1 is not a Tuple2 and has no _._2 - it has only name and servings. Also there is no need to collect:

    vertices.sortBy(_._2.servings)
    
Alper t. Turker
  • 29,733
  • 7
  • 65
  • 101