-1

I'm working in spark and, to employ the Matrix class of the Jama library, I need to convert the content of a spark.sql.DataFrame to a 2D array, i.e., Array[Array[Double]].

While I've found quite several solutions on how to convert a single column of a dataframe to an array, I don't understand how to

  1. transform an entire dataframe into a 2D array (that is, an array of arrays);
  2. while doing so, casting its content from long to Double.

The reason for that is that I need to load the content of a dataframe into a Jama matrix, which requires a 2D array of Doubles as input:

val matrix_transport = new Matrix(df_transport)

<console>:83: error: type mismatch;
 found   : org.apache.spark.sql.DataFrame
    (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 required: Array[Array[Double]]
       val matrix_transport = new Matrix(df_transport)

EDIT: for completeness, the df schema is:

df_transport.printSchema

root
 |-- 1_51501_19962: long (nullable = true)
 |-- 1_51501_26708: long (nullable = true)
 |-- 1_51501_36708: long (nullable = true)
 |-- 1_51501_6708: long (nullable = true)
...

with 165 columns of identical type long.

kilgoretrout
  • 147
  • 2
  • 11
  • What is the schema of your dataframe? In general you are going to need to transforms the rows, then collect them since Jama is going to expect your data to all be on the driver node, which may cause you problems depending on the size of your matrix. – Ryan Widmaier Feb 18 '19 at 16:57
  • all columns are of type `long (nullable = true)`. Size shouldn't be a problem, it's a 165x165 square matrix. – kilgoretrout Feb 18 '19 at 17:04

1 Answers1

1

Here is the rough code to do it. That being said, I don't think Spark provides any guarantees on the order it is returning the rows in, so building the matrix distributed across the cluster may run into issues.

val df = Seq(
    (10l, 11l, 12l),
    (13l, 14l, 15l),
    (16l, 17l, 18l)
).toDF("c1", "c2", "c3")

// Group columns into a single array column
val rowDF = df.select(array(df.columns.map(col):_*) as "row")

// Pull data back to driver and convert Row objects to Arrays
val mat = rowDF.collect.map(_.getSeq[Long](0).toArray)

// Do the casting
val matDouble = mat.map(_.map(_.toDouble))
Ryan Widmaier
  • 5,964
  • 1
  • 24
  • 26
  • Thanks! Indeed `coalesce` will not preserve row order. I rescued that by adding an id column, and later sorting the array by that as follows: `df = df.withColumn("id",monotonicallyIncreasingId)` `[follow instructions in solution]` `val matDouble_sorted = matDouble.sortBy(_(num_cols))` – kilgoretrout Feb 19 '19 at 11:28