I have a data set where I have sequence numbers say 0's and 1's.
Category Value Sequences
1 10 0
1 11 1
1 13 1
1 16 1
1 20 0
1 21 0
1 22 1
1 25 1
1 27 1
1 29 1
1 30 0
1 32 1
1 34 1
1 35 1
1 38 0
Here the 1's in sequences column occurs thrice. I need to sum up that sequence value alone.
I'm trying this by using the below code:
%livy2.spark
import org.apache.spark.rdd.RDD
val df = df.select( $"Category", $"Value", $"Sequences").rdd.groupBy(x =>
(x.getInt(0))
).map(
x => {
val Category= x(0).getInt(0)
val Value= x(0).getInt(1)
val Sequences = x(0).getInt(2)
for (i <- x.indices){
val vi = x(i).getFloat(4)
if (vi(0) >0 )
{
summing+ = Value//
}
(Category, summing)
}
}
)
df_new.take(10).foreach(println)
When I wrote this code error occurs stating that incomplete statement. The value df represents the data set which i gave initially.
The expected output is:
Category summing
1 40
1 103
1 101
I don't know where i'm lagging. It would be great if someone help me in learning out this new thing.