9

I'm looking for a statistics package for Perl (CPAN is fine) that allows me to add data incrementally instead of having to pass in an entire array of data.

Just the mean, median, stddev, max, and min is necessary, nothing too complicated.

The reason for this is because my dataset is entirely too large to fit into memory. The data source is in a MySQL database, so right now I'm just querying a subset of the data and computing the statistics for them, then combining all the manageable subsets later.

If you have other ideas on how to overcome this issue, I'd be much obliged!

brian d foy
  • 121,466
  • 31
  • 192
  • 551
Han
  • 4,534
  • 3
  • 28
  • 31

7 Answers7

5

You can not do an exact stddev and a median unless you either keep the whole thing in memory or run through the data twice.

UPDATE While you can not do an exact stddev IN ONE PASS, there's an approximation one-pass algorithm, the link is in a comment to this answer.

The rest are completely trivial (no need for a module) to do in 3-5 lines of Perl. STDDEV/Median can be done in 2 passes fairly trivially as well (I just rolled out a script that did exactly what you described, but for IP reasons I'm pretty sure I'm not allowed to post it as example for you, sorry)

Sample code:

my ($min, $max)
my $sum = 0;
my $count = 0;
while (<>) {
    chomp;
    my $current_value = $_; #assume input is 1 value/line for simplicity sake
    $sum += $current_value;
    $count++;
    $min = $current_value if (!defined $min || $min > $current_value);
    $max = $current_value if (!defined $max || $max < $current_value);
}
my $mean = $sum * 1.0 / $count;
my $sum_mean_diffs_2 = 0;

while (<>) { # Second pass to compute stddev (use for median too)
    chomp;
    my $current_value = $_; 
    $sum_mean_diffs += ($current_value - $mean) * ($current_value - $mean);
}
my $std_dev = sqrt($sum_mean_diffs / $count);
# Median is left as excercise for the reader.
DVK
  • 119,765
  • 29
  • 201
  • 317
  • 4
    re `$sum * 1.0 / $count`, Perl doesn't do typed operations the way C does; the *1.0 is unnecessary. – ysth Sep 10 '09 at 17:41
  • 8
    This is not true. You can calculate a streaming Mean and std. dev. Details on algorithm can be found on wikipedia: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm – Jeremy Wall Sep 10 '09 at 18:06
  • +1 to last comment. Although the formulas in that Wiki are approximations only, still, Wow-cool! – DVK Sep 10 '09 at 18:59
4

Statistics::Descriptive::Discrete allows you to do this in a manner similar to Statistics::Descriptive, but has been optimized for use with large data sets. (The documentation reports an improvement by two orders of magnitude (100x) in memory usage, for example).

Adam Bellaire
  • 99,441
  • 19
  • 144
  • 160
  • 4
    Be careful - this module (as per POD) is optimized for discretized data e.g. data from an A/D conversion that has a discrete set of possible values. E.g. if your data is produced by an 8 bit A/D then you'd have only 256 possible values in your data set. Even though you might have a million data points, you'd only have 256 different values in those million points. – DVK Sep 10 '09 at 15:24
4

Why don't you simply ask the database for the values you are trying to compute?

Amongst others, MySQL features GROUP BY (Aggregate) functions. For missing functions, all you need is a little SQL.

Community
  • 1
  • 1
innaM
  • 46,211
  • 4
  • 64
  • 85
  • 1
    While it may be OK for his project, calculating statistic is often strongly recommended to be offloaded to application code, because it is much easier to scale up app server resources than DB server. – DVK Sep 10 '09 at 17:16
  • @DVK: I'm not sure I buy that objection. Surely the resources involved in exporting the *data* necessary for the app server to do its own statistics are far greater than allowing the DB server to calculate them itself? – Adam Bellaire Sep 10 '09 at 18:58
4

PDL might provide a possible solution:

Have a look at this previous SO answer which shows how to get means, std dev, etc.

Here is relevant part of code repeated here:

use strict;
use warnings;
use PDL;

my $figs = pdl [
    [0.01, 0.01, 0.02, 0.04, 0.03],
    [0.00, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.02, 0.02, 0.03, 0.02],
    [0.01, 0.00, 0.01, 0.05, 0.03],
];

my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );
Community
  • 1
  • 1
draegtun
  • 22,021
  • 5
  • 45
  • 70
1

@DVK: The one-pass algorithms for calculating mean and standard deviation here http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm are not approximations, and are more numerically robust than the example you give. See references on that page.

mabraham
  • 2,318
  • 1
  • 25
  • 23
0

This is largely untested, so use with care. Since I have bad memory, I checked the algorithm against Wikipedia. I'm not aware of an algorithm to calculate the median from a stream of numbers, but that doesn't mean there isn't one.

#!/usr/bin/perl
use strict;
use warnings;

use MooseX::Declare;

class SimpleStats {
  has 'min'       => (is => 'rw', isa => 'Num', default => 9**9**9);
  has 'max'       => (is => 'rw', isa => 'Num', default => -9**9**9);
  has 'A'         => (is => 'rw', isa => 'Num', default => 0);
  has 'Q'         => (is => 'rw', isa => 'Num', default => 0);
  has 'n'         => (is => 'rw', isa => 'Int', default => 0);
  has 'n_nonzero' => (is => 'rw', isa => 'Int', default => 0);
  has 'sum_w'     => (is => 'rw', isa => 'Int', default => 0);

  method add (Num $x, Num $w = 1) {
    $self->min($x) if $x < $self->min;
    $self->max($x) if $x > $self->max;
    my $n = $self->n;
    if ($n == 0) {
      $self->A($x);
      $self->sum_w($w);
    }
    else {
      my $A = $self->A;
      my $Q = $self->Q;
      my $sum_w_before = $self->sum_w;
      $self->sum_w($sum_w_before+$w);
      $self->A($A + ($x-$A) * $w/$self->sum_w);
      $self->Q($Q + $w*($x-$A)*($x-$self->A));
    }
    $self->n($n+1);
    $self->n_nonzero($self->n_nonzero+1) if $w != 0;
    return();
  }

  method mean () { $self->A }

  method sample_variance () {
    $self->Q * $self->n_nonzero() /
    ( ($self->n_nonzero-1) * $self->sum_w )
  }

  method std_variance () { $self->Q / $self->sum_w }
  method std_dev      () { sqrt($self->std_variance) }

  # slightly evil. Just don't reuse objects
  method reset () { %$self = %{__PACKAGE__->new()} }
}

package main;

my $stats = SimpleStats->new;

while (<STDIN>) {
  s/^\s+//;
  s/\s+$//;
  my ($x, $w) = split /\s+/, $_;
  if (defined $w) {
    $stats->add($x, $w);
  } else {
    $stats->add($x);
  }
}

print "Mean:        ", $stats->mean, "\n";
print "Sample var:  ", $stats->sample_variance, "\n";
print "Std var:     ", $stats->std_variance, "\n";
print "Std dev:     ", $stats->std_dev, "\n";
print "Entries:     ", $stats->n, "\n";
print "Min:         ", $stats->min, "\n";
print "Max:         ", $stats->max, "\n";
tsee
  • 4,960
  • 1
  • 17
  • 27
0

I understand it's 4 years down the road, but in case anyone is interested, there is now a module for memory-efficient approximate statistical sample analysis. It's interface generally follows that of Statistics::Descriptive and co.

It divides the sample into logarithmic intervals, and only keeps hit counts. Thus a fixed relative error is introduced (precision can be adjusted in new()), however large amounts of data may be processed without using much memory.

Dallaylaen
  • 5,138
  • 17
  • 33