mock spark column functions in scala

Question

My code is using monotonically_increasing_id function is scala

val df = List(("oleg"), ("maxim")).toDF("first_name")
   .withColumn("row_id", monotonically_increasing_id)

I want to mock it in my unit test so that it returns integers 0, 1, 2, 3, ...

In my spark-shell it returns the desired result.

scala> df.show
+----------+------+
|first_name|row_id|
+----------+------+
|      oleg|     0|
|     maxim|     1|
+----------+------+

But in my scala applications the results are different.

How can I mock column functions?

I'm not sure I understand your question. Could you provide more details, such as the code snippet you are trying to test? — Oli, Mar 28 '19 at 14:07
They say [Don't mock type you don't own](https://stackoverflow.com/q/47409533) - and this sounds like a good example why you shouldn't - there is no guarantee that monotonically_increasing_id will ever return consecutive numbers. — user10938362, Mar 28 '19 at 14:16
@user10958683 I saw this answer. It contradicts to the very sense of mocking: https://stackoverflow.com/questions/2665812/what-is-mocking . Besides, I don't care if in my production code the numbers are consecutive as far as they are unique. I need predicted results in my unit tests — Oleg Pavliv, Mar 28 '19 at 14:19

score 2 · Answer 1 · answered Mar 28 '19 at 17:25

Mocking such a function so that it produces a sequence is not simple. Indeed, spark is a parallel computing engine and accessing the data in sequence is therefore complicated.

Here is a solution you could try.

Let's define a function that zips a dataframe:

    def zip(df : DataFrame, name : String) = {
        df.withColumn(name, monotonically_increasing_id)
    }

Then let's rewrite the function we want to test using this zip function by default:

    def fun(df : DataFrame,
            zipFun : (DataFrame, String) => DataFrame = zip) : DataFrame = {
        zipFun(df, "id_row")
    }
    // let 's see what it does
    fun(spark.range(5).toDF).show()
    +---+----------+
    | id|    id_row|
    +---+----------+
    |  0|         0|
    |  1|         1|
    |  2|8589934592|
    |  3|8589934593|
    |  4|8589934594|
    +---+----------+

It's the same as before, let's write a new function that uses zipWithIndex from the RDD API. It's a bit tedious because we have to go back and forth between the two APIs.

    def zip2(df : DataFrame, name : String) = {
        val rdd = df.rdd.zipWithIndex
            .map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
        val newSchema = df.schema.add(StructField(name, LongType, false))
        df.sparkSession.createDataFrame(rdd, newSchema)
    }
    fun(spark.range(5).toDF, zip2)
    +---+------+
    | id|id_row|
    +---+------+
    |  0|     0|
    |  1|     1|
    |  2|     2|
    |  3|     3|
    |  4|     4|
    +---+------+

You can adapt zip2, for instance multiplying i by 2, to get what you want.

Thank you, you may check my workaround which I posted in a separate answer — Oleg Pavliv, Mar 28 '19 at 19:33

score 0 · Answer 2 · answered Mar 28 '19 at 19:32

Based on answer from @Oli I came up with the following workaround:

val df = List(("oleg"), ("maxim")).toDF("first_name")
   .withColumn("row_id", monotonically_increasing_id)
   .withColumn("test_id", row_number().over(Window.orderBy("row_id")))

It solves my problem but I'm still interested in mocking column functions.

score 0 · Answer 3 · answered May 07 '21 at 21:41

I mock my spark functions with this code :

val s = typedLit[Timestamp](Timestamp.valueOf("2021-05-07 15:00:46.394"))
implicit val ds = DefaultAnswer(CALLS_REAL_METHODS)
withObjectMocked[functions.type] {
when(functions.current_timestamp()).thenReturn(s)
        // spark logic
}

mock spark column functions in scala

3 Answers3