Python 3.4 calculating the mode,median reading through a file

Question

I was wondering if there was another way to code At the core of this problem is the easiest way to solve this problem is to read a file and save the values in the list. Where then you'd have:

a = [1,2,3,4,5,6,1,1,1,1]
import statistics
listMode = statistics.mode(a) # median, average, etc...

I was wondering instead of having to save these values in a (so memory as it can be quite large), whether I could calculate the mode on the fly as I read the file and update a single value everytime I read a new line i.e. incrementally calculate mode,median and average. So that at the end I'd have a = [mode,median,average].

I can't see such simple operations taking very long even on a very large data set therefore I see no reason to attempt to calculate "on the fly" but too instead do it all at the end — TheLazyScripter, Sep 07 '16 at 05:03
How "incremental" is required? You will need to store at least a value and a count for each unique value in the data set if the file is to be read only once. If it is acceptable to read the file up to the same number of times as the number of values in the file, required storage would go down but the execution time would go up steeply. — Simon, Sep 07 '16 at 05:04
I have quite a leveled nested dictionary and would prefer not to have a huge list and by the same token I'm wanting to do this for a number of variables and it would make life cleaner. If the performance becomes a significant issue then that makes the option clear. However, I was unsuccessful in making mode and median work. Average was a lot easier and experienced an improvement. — FancyDolphin, Sep 07 '16 at 05:07

score 3 · Answer 1 · answered Sep 07 '16 at 04:56

3

If the set of input numbers comes from a reasonably small universe of values, as in your example, you could use a Counter to count how many of each value you see as they pass by. From that Counter you can get the mode easily, and the median with a little work. Calculating the average on the fly is easy, doesn't need the Counter: just keep a running total and a running count.

answered Sep 07 '16 at 04:56

D-Von

396
1
5

it is quite large, that was just a simple example. Counter would improve but an incremental approach would be preferable. Agreed average is trivial even weighted average. – FancyDolphin Sep 07 '16 at 04:59
I don't see any hope of getting an exact mode without counting each value. As soon as you drop a count for some value, an adversary can screw you over by generating a bunch of entries with that value. However, this article talks about calculating an approximate mode: http://stackoverflow.com/questions/1058813/on-line-iterator-algorithms-for-estimating-statistical-median-mode-skewnes – D-Von Sep 07 '16 at 05:12
If you know something about the distribution of your data, you could bucket it and count the number of entries that fall in each bucket. Then you can get an approximate mode by using the midpoint of the heaviest bucket. – D-Von Sep 07 '16 at 05:16
this doesn't work in this case unfortunately otherwise approximation would work. – FancyDolphin Sep 07 '16 at 05:19

Python 3.4 calculating the mode,median reading through a file

1 Answers1