2

I have a dataframe:

data.show()
+--------+------+------------------+
|   Count|  mean|             stdev|
+--------+------+------------------+
|       5|  6337| 1684.569470220803|
|       3|  7224| 567.8250904401182|
|     330| 20280|23954.260831863092|
|      42| 26586|  32957.9072313323|
...
|      49| 23422|21244.094701798418|
|       4| 36949| 8616.596311769514|
|      35| 20915|14971.559603562522|
|      33| 20874|16657.756963894684|
|      14| 22698|15416.614921307082|
|      25| 19100| 12342.11627585264|
|      27| 21879|21363.736895687238|
+--------+------+------------------+

Without using Hive, I want to get the first quartile, second quartile and the IQR (interquartile range) for column "mean".

Other solutions seem to use Hive which everyone might not have access to.

Hive Solution 1

Hive Solution 2

Solution for Python

Community
  • 1
  • 1
Jeremy
  • 530
  • 8
  • 19

1 Answers1

1

I'd like to first note that this seems to be a pretty expensive solution but I get precisely what I want wihtout using Hive. If you are able to use Hive definitely do it because it couldn't be any easier.

I ended up using commons-math3 jar. The trick to using it was getting the data out of the dataframe and into an array for consumption by the math3 library. I solved that from HERE. You may have to play with the "asInstanceOf" based on the datatype of the column.

import org.apache.commons.math3.stat.descriptive._

// Turn dataframe column into an Array[Long]
val mean = data.select("mean").rdd.map(row => row(0).asInstanceOf[Long]).collect()

// Create the math3 object and add values from the
// mean array to the descriptive statistics array
val arrMean = new DescriptiveStatistics()
genericArrayOps(mean).foreach(v => arrMean.addValue(v))

// Get first and third quartiles and then calc IQR
val meanQ1 = arrMean.getPercentile(25)
val meanQ3 = arrMean.getPercentile(75)
val meanIQR = meanQ3 - meanQ1
Community
  • 1
  • 1
Jeremy
  • 530
  • 8
  • 19