I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?
Asked
Active
Viewed 162 times
-3
-
Hello and welcome to StackOverflow. Please take some time to read the help page, especially the sections named ["What topics can I ask about here?"](http://stackoverflow.com/help/on-topic) and ["What types of questions should I avoid asking?"](http://stackoverflow.com/help/dont-ask). And more importantly, please read [the Stack Overflow question checklist](http://meta.stackexchange.com/q/156810/204922). You might also want to learn about [Minimal, Complete, and Verifiable Examples](http://stackoverflow.com/help/mcve). – sarveshseri Apr 05 '17 at 10:23
1 Answers
0
Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:
marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")
If you do not need an exact figure you can look into using percentile_approx() instead.
Documentation for both functions is located here.
![](../../users/profiles/2689155.webp)
Nils
- 426
- 2
- 5