-3

I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?

  • Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named ["What topics can I ask about here?"](http://stackoverflow.com/help/on-topic) and ["What types of questions should I avoid asking?"](http://stackoverflow.com/help/dont-ask). And more importantly, please read [the Stack Overflow question checklist](http://meta.stackexchange.com/q/156810/204922). You might also want to learn about [Minimal, Complete, and Verifiable Examples](http://stackoverflow.com/help/mcve). – sarveshseri Apr 05 '17 at 10:23

1 Answers1

0

Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:

marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")

If you do not need an exact figure you can look into using percentile_approx() instead.

Documentation for both functions is located here.

Nils
  • 426
  • 2
  • 5