Standard Deviation for SQLite

Question

I've searched the SQLite docs and couldn't find anything, but I've also searched on Google and a few results appeared.

Does SQLite have any built-in Standard Deviation function?

score 27 · Answer 1 · edited Jan 20 '14 at 19:54

27

You can calculate the variance in SQL:

create table t (row int);
insert into t values (1),(2),(3);
SELECT AVG((t.row - sub.a) * (t.row - sub.a)) as var from t, 
    (SELECT AVG(row) AS a FROM t) AS sub;
0.666666666666667

However, you still have to calculate the square root to get the standard deviation.

edited Jan 20 '14 at 19:54

Jess

20,424
18
108
130

answered Jan 06 '11 at 23:53

Anonymous

279
3
2

You can also `group by` using this technique... `SELECT AVG((t.num - sub.a) * (t.num - sub.a)) as var from t, (SELECT name, AVG(t.num) AS a FROM t group by name) AS sub where t.name = sub.name group by sub.name` – Jess Jan 20 '14 at 22:06
3

It should be noted that this algorithm calculates a biased version of the variance. Variance should always be calculated as $\frac{1}{n-1} \sum(X_i - \bar{X})$, but you are dividing by $n$ rather than $n-1$. As $n \rightarrow \infty$ this difference becomes negligible, but people will wonder why your variances don't match up with any standard software... – ahfoss Apr 13 '14 at 16:29
2

Sorry, I thought this would render latex. In plain english, this way of calculating variance is incorrect, but not absolutely horrible. For a large number of observations the difference between this version and the usual method will be rather small, but you will be left having to explain why your variances don't match standard software output. You can tell a statistician you are using the biased version of variance, but I've found less quantitatively inclined people are uncomfortable if your variances don't match standard software. – ahfoss Apr 13 '14 at 16:34

score 16 · Answer 2 · answered Feb 19 '10 at 17:35

16

The aggregate functions supported by SQLite are here:

http://www.sqlite.org/lang_aggfunc.html

STDEV is not in the list.

However, the module extension-functions.c in this page contains a STDEV function.

answered Feb 19 '10 at 17:35

Robert Harvey

168,684
43
314
475

alex.forencich · Answer 3 · 2014-12-17T01:30:44.133

15

There is still no built-in stdev function in sqlite. However, you can define (as Alix has done) a user-defined aggregator function. Here is a complete example in Python:

import sqlite3
import math

class StdevFunc:
    def __init__(self):
        self.M = 0.0
        self.S = 0.0
        self.k = 1

    def step(self, value):
        if value is None:
            return
        tM = self.M
        self.M += (value - tM) / self.k
        self.S += (value - tM) * (value - self.M)
        self.k += 1

    def finalize(self):
        if self.k < 3:
            return None
        return math.sqrt(self.S / (self.k-2))

with sqlite3.connect(':memory:') as con:

    con.create_aggregate("stdev", 1, StdevFunc)

    cur = con.cursor()

    cur.execute("create table test(i)")
    cur.executemany("insert into test(i) values (?)", [(1,), (2,), (3,), (4,), (5,)])
    cur.execute("insert into test(i) values (null)")
    cur.execute("select avg(i) from test")
    print("avg: %f" % cur.fetchone()[0])
    cur.execute("select stdev(i) from test")
    print("stdev: %f" % cur.fetchone()[0])

This will print:

avg: 3.000000
stdev: 1.581139

Compare with MySQL: http://sqlfiddle.com/#!2/ad42f3/3/0

edited Dec 17 '14 at 01:30

answered Jun 26 '14 at 05:57

alex.forencich

1,187
8
16

1

I'm the OP in case you haven't noticed. – Alix Axel Jun 26 '14 at 05:59
Ah, I didn't realize you answered your own quesion. – alex.forencich Jun 26 '14 at 08:42
Very useful solution! But I think there is a small mistake somewhere because the standard deviation of [1, 2, 3, 4, 5] is 1.41 and not 1.58 – user2660966 Mar 24 '17 at 14:48
There are two versions of standard deviation. This one matches the MySQL implementation. You can edit it for the other version of stdev if you like. I think you would need to change the self.k-2 to self.k-1 . – alex.forencich Mar 24 '17 at 15:50
1

I tried to implement the stddev function in your method, but i got error "user-defined aggregate's 'step' method raised error", any advice? – Derrick.X Jul 04 '18 at 01:35

score 4 · Accepted Answer · edited May 23 '17 at 12:25

4

I implemented the Welford's method (the same as extension-functions.c) as a SQLite UDF:

$db->sqliteCreateAggregate('stdev',
    function (&$context, $row, $data) // step callback
    {
        if (isset($context) !== true) // $context is null at first
        {
            $context = array
            (
                'k' => 0,
                'm' => 0,
                's' => 0,
            );
        }

        if (isset($data) === true) // the standard is non-NULL values only
        {
            $context['s'] += ($data - $context['m']) * ($data - ($context['m'] += ($data - $context['m']) / ++$context['k']));
        }

        return $context;
    },
    function (&$context, $row) // fini callback
    {
        if ($context['k'] > 0) // return NULL if no non-NULL values exist
        {
            return sqrt($context['s'] / $context['k']);
        }

        return null;
    },
1);

That's in PHP ($db is the PDO object) but it should be trivial to port to another language.

SQLite is soooo cool. <3

edited May 23 '17 at 12:25

Community

1
1

answered Aug 27 '13 at 16:03

Alix Axel

141,486
84
375
483

I believe you want to divide by n-1 at the end, not n. That is, use `($context['k'] - 1)`. The post you linked to regarding Welford's method has been updated also to reflect this correction. – ahfoss Apr 13 '14 at 18:00
1

@ahfoss: Note that the other post starts k at 1 whereas I start it at 0. He also uses the post-increment operator whereas I use the pre-increment one. – Alix Axel Apr 14 '14 at 01:33
That other post had an off-by-one error. It is now fixed, but your post has the same issue. – alex.forencich Jun 26 '14 at 05:51
@alex.forencich: Isn't the same thing ahfoss mentioned? – Alix Axel Jun 26 '14 at 05:56
1

Possibly. Depends on which version of stdev you want to implement. In Excel, STDEV(1,2,3,4,5) = 1.58. Your code would return 1.41. – alex.forencich Jun 26 '14 at 08:48
This is the stdev_pop version. – Alix Axel Oct 19 '14 at 13:55

score 4 · Answer 5 · answered Nov 04 '19 at 14:39

4

Use variance formula V(X) = E(X^2) - E(X)^2. In SQL sqlite

SELECT AVG(col*col) - AVG(col)*AVG(col) FROM table

To get standard deviation you need to take the square root V(X)^(1/2)

answered Nov 04 '19 at 14:39

Don Juan

41
1

Good shortcut, but that variance formula is only asymptotically correct. Does not really replace a proper implementation. – JuliettVictor Sep 19 '20 at 21:24
This worked for me. Reconciled with MS Excel. – Scott Feb 21 '21 at 06:05

score 3 · Answer 6 · edited May 18 '16 at 07:49

3

a little trick

select ((sum(value)*sum(value) - sum(value * value))/((count(*)-1)*(count(*)))) 
from the_table ;

then the only thing left is to calculate sqrt outside.

edited May 18 '16 at 07:49

Thomas G

9,222
7
24
36

answered May 18 '16 at 07:37

user293074

41
4

3

I think you got it wrong. For values 1 and 3, the stdev² should be 2. Yours compute 3. But if I understand what you've tried to do (inline the computation of the average), it should be something like this: `SELECT (SUM(value*value) - SUM(value)*SUM(value)/COUNT(*)) / (COUNT(*)-1)`. – Celelibi Sep 17 '16 at 02:30
oops, typo! yeah, `select ((count(*)*sum(value * value) - (sum(value)*sum(value))/((count(*)-1)*(count(*)))) from ... ;` == yours. thanks~ – user293074 Oct 25 '16 at 02:40

score 2 · Answer 7 · answered Feb 19 '10 at 17:35

2

No, I searched this same issue, and ended having to do the calculations with my application (PHP)

answered Feb 19 '10 at 17:35

tpow

6,962
11
55
83

Rick Shory · Answer 8 · 2014-10-19T12:23:31.983

added some error detection in the python functions

class StdevFunc:
    """
    For use as an aggregate function in SQLite
    """
    def __init__(self):
        self.M = 0.0
        self.S = 0.0
        self.k = 0

    def step(self, value):
        try:
            # automatically convert text to float, like the rest of SQLite
            val = float(value) # if fails, skips this iteration, which also ignores nulls
            tM = self.M
            self.k += 1
            self.M += ((val - tM) / self.k)
            self.S += ((val - tM) * (val - self.M))
        except:
            pass

    def finalize(self):
        if self.k <= 1: # avoid division by zero
            return none
        else:
            return math.sqrt(self.S / (self.k-1))

score 0 · Answer 9 · answered Aug 29 '19 at 21:48

You don't state which version of standard deviation you wish to calculate but variances (standard deviation squared) for either version can be calculated using a combination of the sum() and count() aggregate functions.

select  
(count(val)*sum(val*val) - (sum(val)*sum(val)))/((count(val)-1)*(count(val))) as sample_variance,
(count(val)*sum(val*val) - (sum(val)*sum(val)))/((count(val))*(count(val))) as population_variance
from ... ;

It will still be necessary to take the square root of these to obtain the standard deviation.

score -3 · Answer 10 · answered Aug 07 '16 at 02:06

#!/usr/bin/python
# -*- coding: utf-8 -*-
#Values produced by this script can be verified by follwing the steps
#found at https://support.microsoft.com/en-us/kb/213930 to Verify
#by chosing a non memory based database.
import sqlite3
import math
import random
import os
import sys
import traceback
import random

class StdevFunc:
    def __init__(self):
        self.M = 0.0    #Mean
        self.V = 0.0    #Used to Calculate Variance
        self.S = 0.0    #Standard Deviation
        self.k = 1      #Population or Small 

    def step(self, value):
        try:
            if value is None:
                return None

            tM = self.M
            self.M += (value - tM) / self.k
            self.V += (value - tM) * (value - self.M)
            self.k += 1
        except Exception as EXStep:
            pass
            return None    

     def finalize(self):
        try:
            if ((self.k - 1) < 3):
                return None

            #Now with our range Calculated, and Multiplied finish the Variance Calculation
            self.V = (self.V / (self.k-2))

            #Standard Deviation is the Square Root of Variance
            self.S = math.sqrt(self.V)

            return self.S
        except Exception as EXFinal:
            pass
            return None 

def Histogram(Population):
    try:
        BinCount = 6 
        More = 0

        #a = 1          #For testing Trapping
        #b = 0          #and Trace Back
        #c = (a / b)    #with Detailed Info

        #If you want to store the Database
        #uncDatabase = os.path.join(os.getcwd(),"BellCurve.db3")
        #con = sqlite3.connect(uncDatabase)

        #If you want the database in Memory
        con = sqlite3.connect(':memory:')    

        #row_factory allows accessing fields by Row and Col Name
        con.row_factory = sqlite3.Row

        #Add our Non Persistent, Runtime Standard Deviation Function to the Database
        con.create_aggregate("Stdev", 1, StdevFunc)

        #Lets Grab a Cursor
        cur = con.cursor()

        #Lets Initialize some tables, so each run with be clear of previous run
        cur.executescript('drop table if exists MyData;') #executescript requires ; at the end of the string
        cur.execute("create table IF NOT EXISTS MyData('ID' INTEGER PRIMARY KEY   AUTOINCREMENT, 'Val' FLOAT)")
        cur.executescript('drop table if exists Bins;')   #executescript requires ; at the end of the string
        cur.execute("create table IF NOT EXISTS Bins('ID' INTEGER PRIMARY KEY   AUTOINCREMENT, 'Bin' UNSIGNED INTEGER, 'Val' FLOAT, 'Frequency' UNSIGNED BIG INT)")

        #Lets generate some random data, and insert in to the Database
        for n in range(0,(Population)):
            sql = "insert into MyData(Val) values ({0})".format(random.uniform(-1,1))
            #If Whole Number Integer greater that value of 2, Range Greater that 1.5
            #sql = "insert into MyData(Val) values ({0})".format(random.randint(-1,1))
            cur.execute(sql)
            pass

        #Now let’s calculate some built in Aggregates, that SQLite comes with
        cur.execute("select Avg(Val) from MyData")
        Average = cur.fetchone()[0]
        cur.execute("select Max(Val) from MyData")
        Max = cur.fetchone()[0]
        cur.execute("select Min(Val) from MyData")
        Min = cur.fetchone()[0]
        cur.execute("select Count(Val) from MyData")
        Records = cur.fetchone()[0]

        #Now let’s get Standard Deviation using our function that we added
        cur.execute("select Stdev(Val) from MyData")
        Stdev = cur.fetchone()[0]

        #And Calculate Range
        Range = float(abs(float(Max)-float(Min)))

        if (Stdev == None):
            print("================================   Data Error ===============================")
            print("                 Insufficient Population Size, Or Bad Data.")   
            print("*****************************************************************************")
        elif (abs(Max-Min) == 0):
            print("================================   Data Error ===============================")
            print(" The entire Population Contains Identical values, Distribution Incalculable.")
            print("******************************************************************************")            
        else:  
            Bin = []        #Holds the Bin Values
            Frequency = []  #Holds the Bin Frequency for each Bin

            #Establish the 1st Bin, which is based on (Standard Deviation * 3) being subtracted from the Mean
            Bin.append(float((Average - ((3 * Stdev)))))
            Frequency.append(0)

            #Establish the remaining Bins, which is basically adding 1 Standard Deviation
            #for each interation, -3, -2, -1, 1, 2, 3             
            for b in range(0,(BinCount) + 1):
                Bin.append((float(Bin[(b)]) + Stdev))
                Frequency.append(0)

            for b in range(0,(BinCount) + 1):
                #Lets exploit the Database and have it do the hard work calculating distribution
                #of all the Bins, with SQL's between operator, but making it left inclusive, right exclusive.
                sqlBinFreq = "select count(*) as Frequency from MyData where val between {0} and {1} and Val < {2}". \
                             format(float((Bin[b])), float(Bin[(b + 1)]), float(Bin[(b + 1)]))

                #If the Database Reports Values that fall between the Current Bin, Store the Frequency to a Bins Table. 
                for rowBinFreq in cur.execute(sqlBinFreq):
                    Frequency[(b + 1)] = rowBinFreq['Frequency']
                    sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
                                   format(b, float(Bin[b]), Frequency[(b)])
                    cur.execute(sqlBinFreqInsert)

                #Allthough this Demo is not likley produce values that
                #fall outside of Standard Distribution
               #if this demo was to Calculate with real data, we want to know
               #how many non-Standard data points we have. 
               More = (More + Frequency[b])

            More = abs((Records - More))

            #Add the More value
            sqlBinFreqInsert = "insert into Bins (Bin, Val, Frequency) values ({0}, {1}, {2})". \
                            format((BinCount + 1), float(0), More)
            cur.execute(sqlBinFreqInsert)

            #Now Report the Analysis
            print("================================ The Population ==============================")
            print("             {0} {1} {2} {3} {4} {5}". \
              format("Size".rjust(10, ' '), \
                     "Max".rjust(10, ' '), \
                     "Min".rjust(10, ' '), \
                     "Mean".rjust(10, ' '), \
                     "Range".rjust(10, ' '), \
                     "Stdev".rjust(10, ' ')))
            print("Aggregates:  {0:10d} {1:10.4f} {2:10.4f} {3:10.4f} {4:10.4f} {5:10.4f}". \
              format(Population, Max, Min, Average, Range, Stdev))
             print("================================= The Bell Curve =============================")  

            LabelString = "{0} {1}  {2}  {3}". \
                      format("Bin".ljust(8, ' '), \
                             "Ranges".rjust(8, ' '), \
                             "Frequency".rjust(8, ' '), \
                             "Histogram".rjust(6, ' '))

            print(LabelString)
            print("------------------------------------------------------------------------------")

            #Let's Paint a Histogram
            sqlChart = "select * from Bins order by Bin asc"
            for rowChart in cur.execute(sqlChart):
                if (rowChart['Bin'] == 7):
                    #Bin 7 is not really a bin, but where we place the values that did not fit into the
                    #Normal Distribution. This script was tested against Excel's Bell Curve Example
                    #https://support.microsoft.com/en-us/kb/213930
                    #and produces the same results. Feel free to test it.
                    BinName = "More"
                    ChartString = "{0:<6} {1:<10} {2:10.0f}". \
                            format(BinName, \
                                    "", \
                                    More)
                else:
                    #Theses are the actual bins where values fall within the distribution.
                    BinName = (rowChart['Bin'] + 1)
                    #Scale the Chart
                    fPercent = ((float(rowChart['Frequency']) / float(Records) * 100))
                    iPrecent = int(math.ceil(fPercent))

                    ChartString = "{0:<6} {1:10.4f} {2:10.0f}  {3}". \
                              format(BinName, \
                                     rowChart['Val'], \
                                     rowChart['Frequency'], \
                                     "".rjust(iPrecent, '#'))
                print(ChartString)

            print("******************************************************************************")

            #Commit to Database
            con.commit()

            #Clean Up
            cur.close()
            con.close()

    except Exception as EXBellCurve:
        pass
        TraceInfo = traceback.format_exc()       
        raise Exception(TraceInfo)

The question "Does SQLite have any built-in Standard Deviation function?" how does this answer that question? — avanderw, May 28 '20 at 08:13

Standard Deviation for SQLite

10 Answers10

Linked