I would suggest you define a User Defined Aggregation Function
(UDAF
)
Using inbuilt functions
are great ways but they are difficult to be customized. If you own a UDAF
then it is customizable and you can edit it according to your needs.
Concerning your problem, following can be your solution. You can edit it according to your needs.
First task is to define a UDAF
class PingJiang extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("group_a", StringType).add("group_b", StringType)
def bufferSchema = new StructType().add("buff0", StringType).add("buff1", StringType)
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, "")
buffer.update(1, "")
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getString(0)
val groupa = input.getString(0)
val groupb = input.getString(1)
if(!groupa.equalsIgnoreCase("unknown")){
buffer.update(0, groupa)
}
if(!groupb.equalsIgnoreCase("unknown")){
buffer.update(1, groupb)
}
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getString(0)+buffer2.getString(0)
val buff2 = buffer1.getString(1)+buffer2.getString(1)
buffer1.update(0, buff1+","+buff2)
}
def evaluate(buffer: Row) : String = {
buffer.getString(0)
}
}
Then you call it from your main
class and do some manipulations to get the result you need as
val data = Seq(
(1, "3", "unknown"),
(1, "unknown", "4"),
(2, "unknown", "3"),
(2, "2", "unknown"))
.toDF("uid", "group_a", "group_b")
val udaf = new PingJiang()
val result = data.groupBy("uid").agg(udaf($"group_a", $"group_b").as("ping"))
.withColumn("group_a", split($"ping", ",")(0))
.withColumn("group_b", split($"ping", ",")(1))
.drop("ping")
result.show(false)
Visit databricks and augmentiq for better understanding of UDAF
Note : The above solution gets you the latest value for each group if present (You can always edit according to your needs)