Python - Is a dictionary slow to find frequency of each character?

Question

I am trying to find a frequency of each symbol in any given text using an algorithm of O(n) complexity. My algorithm looks like:

s = len(text) 
P = 1.0/s 
freqs = {} 
for char in text: 
    try: 
       freqs[char]+=P 
    except: 
       freqs[char]=P

but I doubt that this dictionary-method is fast enough, because it depends on the underlying implementation of the dictionary methods. Is this the fastest method?

UPDATE: there is no increase in speed if collections and integers are used. It is because the algorithm is already of O(n) complexity, so no essential speedup is possible.

For example, results for 1MB text:

without collections:
real    0m0.695s

with collections:
real    0m0.625s

Dictionary operations use hashes and are O(1). How can it possibly be "not fast enough"? What do you mean by "fast enough"? What have you measured? What is your goal? — S.Lott, Mar 26 '10 at 10:06
@psihodelia Philosoraptor once said, use an Integer for Integer mathematics. — orokusaki, Mar 26 '10 at 16:33

score 47 · Accepted Answer · edited Jun 20 '20 at 09:12

Performance comparison

Note: time in the table doesn't include the time it takes to load files.

| approach       | american-english, |      big.txt, | time w.r.t. defaultdict |
|                |     time, seconds | time, seconds |                         |
|----------------+-------------------+---------------+-------------------------|
| Counter        |             0.451 |         3.367 |                     3.6 |
| setdefault     |             0.348 |         2.320 |                     2.5 |
| list           |             0.277 |         1.822 |                       2 |
| try/except     |             0.158 |         1.068 |                     1.2 |
| defaultdict    |             0.141 |         0.925 |                       1 |
| numpy          |             0.012 |         0.076 |                   0.082 |
| S.Mark's ext.  |             0.003 |         0.019 |                   0.021 |
| ext. in Cython |             0.001 |         0.008 |                  0.0086 |
#+TBLFM: $4=$3/@7$3;%.2g

The files used: '/usr/share/dict/american-english' and 'big.txt'.

The script that compares 'Counter', 'setdefault', 'list', 'try/except', 'defaultdict', 'numpy', 'cython' -based, and @S.Mark's solutions is at http://gist.github.com/347000

The fastest solution is Python extension written in Cython:

import cython

@cython.locals(
    chars=unicode,
    i=cython.Py_ssize_t,
    L=cython.Py_ssize_t[0x10000])
def countchars_cython(chars):
    for i in range(0x10000): # unicode code points > 0xffff are not supported
        L[i] = 0

    for c in chars:
        L[c] += 1

    return {unichr(i): L[i] for i in range(0x10000) if L[i]}

Previous comparison:

* python (dict) : 0.5  seconds
* python (list) : 0.5  (ascii) (0.2 if read whole file in memory)
* perl          : 0.5
* python (numpy): 0.07 
* c++           : 0.05
* c             : 0.008 (ascii)

Input data:

$ tail /usr/share/dict/american-english
éclat's
élan
élan's
émigré
émigrés
épée
épées
étude
étude's
études

$ du -h /usr/share/dict/american-english
912K    /usr/share/dict/american-english

python (Counter): 0.5 seconds

#!/usr/bin/env python3.1
import collections, fileinput, textwrap

chars = (ch for word in fileinput.input() for ch in word.rstrip())
# faster (0.4s) but less flexible: chars = open(filename).read()
print(textwrap.fill(str(collections.Counter(chars)), width=79))

Run it:

$ time -p python3.1 count_char.py /usr/share/dict/american-english

Counter({'e': 87823, 's': 86620, 'i': 66548, 'a': 62778, 'n': 56696, 'r':
56286, 't': 51588, 'o': 48425, 'l': 39914, 'c': 30020, 'd': 28068, 'u': 25810,
"'": 24511, 'g': 22262, 'p': 20917, 'm': 20747, 'h': 18453, 'b': 14137, 'y':
12367, 'f': 10049, 'k': 7800, 'v': 7573, 'w': 6924, 'z': 3088, 'x': 2082, 'M':
1686, 'C': 1549, 'S': 1515, 'q': 1447, 'B': 1387, 'j': 1376, 'A': 1345, 'P':
974, 'L': 912, 'H': 860, 'T': 858, 'G': 811, 'D': 809, 'R': 749, 'K': 656, 'E':
618, 'J': 539, 'N': 531, 'W': 507, 'F': 502, 'O': 354, 'I': 344, 'V': 330, 'Z':
150, 'Y': 140, 'é': 128, 'U': 117, 'Q': 63, 'X': 42, 'è': 29, 'ö': 12, 'ü': 12,
'ó': 10, 'á': 10, 'ä': 7, 'ê': 6, 'â': 6, 'ñ': 6, 'ç': 4, 'å': 3, 'û': 3, 'í':
2, 'ô': 2, 'Å': 1})
real 0.44
user 0.43
sys 0.01

perl: 0.5 seconds

time -p perl -MData::Dumper -F'' -lanwe'$c{$_}++ for (@F);
END{ $Data::Dumper::Terse = 1; $Data::Dumper::Indent = 0; print Dumper(\%c) }
' /usr/share/dict/american-english

Output:

{'S' => 1515,'K' => 656,'' => 29,'d' => 28068,'Y' => 140,'E' => 618,'y' => 12367,'g' => 22262,'e' => 87823,'' => 2,'J' => 539,'' => 241,'' => 3,'' => 6,'' => 4,'' => 128,'D' => 809,'q' => 1447,'b' => 14137,'z' => 3088,'w' => 6924,'Q' => 63,'' => 10,'M' => 1686,'C' => 1549,'' => 10,'L' => 912,'X' => 42,'P' => 974,'' => 12,'\'' => 24511,'' => 6,'a' => 62778,'T' => 858,'N' => 531,'j' => 1376,'Z' => 150,'u' => 25810,'k' => 7800,'t' => 51588,'' => 6,'W' => 507,'v' => 7573,'s' => 86620,'B' => 1387,'H' => 860,'c' => 30020,'' => 12,'I' => 344,'' => 3,'G' => 811,'U' => 117,'F' => 502,'' => 2,'r' => 56286,'x' => 2082,'V' => 330,'h' => 18453,'f' => 10049,'' => 1,'i' => 66548,'A' => 1345,'O' => 354,'n' => 56696,'m' => 20747,'l' => 39914,'' => 7,'p' => 20917,'R' => 749,'o' => 48425}
real 0.51
user 0.49
sys 0.02

python (numpy): 0.07 seconds

Based on Ants Aasma's answer (modified to support unicode):

#!/usr/bin/env python
import codecs, itertools, operator, sys
import numpy

filename = sys.argv[1] if len(sys.argv)>1 else '/usr/share/dict/american-english'

# ucs2 or ucs4 python?
dtype = {2: numpy.uint16, 4: numpy.uint32}[len(buffer(u"u"))]

# count ordinals
text = codecs.open(filename, encoding='utf-8').read()
a = numpy.frombuffer(text, dtype=dtype)
counts = numpy.bincount(a)

# pretty print
counts = [(unichr(i), v) for i, v in enumerate(counts) if v]
counts.sort(key=operator.itemgetter(1))
print ' '.join('("%s" %d)' % c for c in counts  if c[0] not in ' \t\n')

Output:

("Å" 1) ("í" 2) ("ô" 2) ("å" 3) ("û" 3) ("ç" 4) ("â" 6) ("ê" 6) ("ñ" 6) ("ä" 7) ("á" 10) ("ó" 10) ("ö" 12) ("ü" 12) ("è" 29) ("X" 42) ("Q" 63) ("U" 117) ("é" 128) ("Y" 140) ("Z" 150) ("V" 330) ("I" 344) ("O" 354) ("F" 502) ("W" 507) ("N" 531) ("J" 539) ("E" 618) ("K" 656) ("R" 749) ("D" 809) ("G" 811) ("T" 858) ("H" 860) ("L" 912) ("P" 974) ("A" 1345) ("j" 1376) ("B" 1387) ("q" 1447) ("S" 1515) ("C" 1549) ("M" 1686) ("x" 2082) ("z" 3088) ("w" 6924) ("v" 7573) ("k" 7800) ("f" 10049) ("y" 12367) ("b" 14137) ("h" 18453) ("m" 20747) ("p" 20917) ("g" 22262) ("'" 24511) ("u" 25810) ("d" 28068) ("c" 30020) ("l" 39914) ("o" 48425) ("t" 51588) ("r" 56286) ("n" 56696) ("a" 62778) ("i" 66548) ("s" 86620) ("e" 87823)
real 0.07
user 0.06
sys 0.01

c++: 0.05 seconds

// $ g++ *.cc -lboost_program_options 
// $ ./a.out /usr/share/dict/american-english    
#include <iostream>
#include <fstream>
#include <cstdlib> // exit

#include <boost/program_options/detail/utf8_codecvt_facet.hpp>
#include <boost/tr1/unordered_map.hpp>
#include <boost/foreach.hpp>

int main(int argc, char* argv[]) {
  using namespace std;

  // open input file
  if (argc != 2) {
    cerr << "Usage: " << argv[0] << " <filename>\n";
    exit(2);
  }
  wifstream f(argv[argc-1]); 

  // assume the file has utf-8 encoding
  locale utf8_locale(locale(""), 
      new boost::program_options::detail::utf8_codecvt_facet);
  f.imbue(utf8_locale); 

  // count characters frequencies
  typedef std::tr1::unordered_map<wchar_t, size_t> hashtable_t;  
  hashtable_t counts;
  for (wchar_t ch; f >> ch; )
    counts[ch]++;
  
  // print result
  wofstream of("output.utf8");
  of.imbue(utf8_locale);
  BOOST_FOREACH(hashtable_t::value_type i, counts) 
    of << "(" << i.first << " " << i.second << ") ";
  of << endl;
}

Result:

$ cat output.utf8

(í 2) (O 354) (P 974) (Q 63) (R 749) (S 1,515) (ñ 6) (T 858) (U 117) (ó 10) (ô 2) (V 330) (W 507) (X 42) (ö 12) (Y 140) (Z 150) (û 3) (ü 12) (a 62,778) (b 14,137) (c 30,020) (d 28,068) (e 87,823) (f 10,049) (g 22,262) (h 18,453) (i 66,548) (j 1,376) (k 7,800) (l 39,914) (m 20,747) (n 56,696) (o 48,425) (p 20,917) (q 1,447) (r 56,286) (s 86,620) (t 51,588) (u 25,810) (Å 1) (' 24,511) (v 7,573) (w 6,924) (x 2,082) (y 12,367) (z 3,088) (A 1,345) (B 1,387) (C 1,549) (á 10) (â 6) (D 809) (E 618) (F 502) (ä 7) (å 3) (G 811) (H 860) (ç 4) (I 344) (J 539) (è 29) (K 656) (é 128) (ê 6) (L 912) (M 1,686) (N 531)

c (ascii): 0.0079 seconds

// $ gcc -O3 cc_ascii.c -o cc_ascii && time -p ./cc_ascii < input.txt
#include <stdio.h>

enum { N = 256 };
size_t counts[N];

int main(void) {
  // count characters
  int ch = -1;
  while((ch = getchar()) != EOF)
    ++counts[ch];
  
  // print result
  size_t i = 0;
  for (; i < N; ++i) 
    if (counts[i])
      printf("('%c' %zu) ", (int)i, counts[i]);
  return 0;
}

+1: Wow! that's quite a thorough comparison, with interesting approaches! — Eric O Lebigot, Mar 27 '10 at 08:36
@J.F. I've posted my C Char Counter Extension too, could you include/test with your benchmark too? http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2532564#2532564 — YOU, Mar 28 '10 at 10:44
@S.Mark: I've compared 'numpy'-based variant and yours (I've called it 'smark') Complete programs have the same time for 'american-english' (90ms). But profiler shows that 'smark' 4 times faster than 'numpy' (only the counting part without the loading file and converting it to unicode parts). For big.txt: 'numpy' - 170ms, 'smark' - 130ms, cc_ascii(fgets) - 30ms ('numpy' is 3 times slower than 'smark' if we disregard reading, decoding file). — jfs, Mar 28 '10 at 19:45
@J.F. , Thanks for including mine and great comparison chart. I think If you move char_counter, and char_list to module level variable, and malloc-ating at PyMODINIT_FUNC (its one time allocating when import module), and compiled with gcc -O3, it could get cc_ascii level speed I think. — YOU, Mar 29 '10 at 01:37
@S.Mark: static variables in C are initialized only once. Take a look http://gist.github.com/347279 There is no point to move `char_counter` and `char_list` to the global level. — jfs, Mar 29 '10 at 02:21
@J.F. Ah ok, you're right, I missed `!char_counter &&` part you added. — YOU, Mar 29 '10 at 02:36
@S.Mark: btw, `-O3` flag has no effect (same 0.13 seconds for big.txt). I've build the extension with the command: `$ CFLAGS=-O3 python setup.py build --force`. — jfs, Mar 29 '10 at 21:10
@J.F. I see, So, the only difference now is, I didn't count the time for malloc in my test and OS diff (I've tested on Windows, but I don't think Windows is faster for any reason for C codes). Anyway, I think its fast enough, and thanks for the lot of tests, cheers! — YOU, Mar 30 '10 at 09:26
In Cython example, in `countchars_cython`, where you do `for i in range(0x10000): L[i] = 0` I would do: `L = [0]*0x10000` Does that work for Cython? — Aaron Hall, May 12 '14 at 04:23
@AaronHall: yes, you can use `[0]*0x10000` in Cython because it is a (almost) superset of Python. No, you can't use it in this case because `L` is a C array (non-L-value) — jfs, May 12 '14 at 08:59

score 16 · Answer 2 · edited Mar 30 '10 at 22:32

16

How about avoiding float operations inside the loop and do it after everything is done?

By that way, you could just do +1 everytime, and its should be faster.

And better use collections.defaultdict as S.Lott advised.

freqs=collections.defaultdict(int)

for char in text: 
   freqs[char]+=1

Or You may want to try, collections.Counter in python 2.7+

>>> collections.Counter("xyzabcxyz")
Counter({'y': 2, 'x': 2, 'z': 2, 'a': 1, 'c': 1, 'b': 1})

Or

You may try psyco, which do just-in-time compiling for python. You have loops, so I think you would get some performance gain with psyco

Edit 1:

I did some benchmarks base on big.txt (~6.5 MB) which is used in spelling corrector by peter norvig

Text Length: 6488666

dict.get : 11.9060001373 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....

if char in dict : 9.71799993515 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....

dict try/catch : 7.35899996758 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....

collections.default : 7.29699993134 s
93 chars defaultdict(<type 'int'>, {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8, ....

_{CPU Specs: 1.6GHz Intel Mobile Atom CPU}

According to that, dict.get is ~~slowest~~ and collections.defaultdict is fastest, try/except is also the fast one.

Edit 2:

Added collections.Counter benchmarks, Its slower than dict.get and took 15s in my laptop

collections.Counter : 15.3439998627 s
93 chars Counter({u' ': 1036511, u'e': 628234, u't': 444459, u'a': 395872, u'o': 382683, u'n': 362397, u'i': 348464,

edited Mar 30 '10 at 22:32

jfs

346,887
152
868
1,518

answered Mar 26 '10 at 09:42

YOU

106,832
29
175
207

Very good advice! It should also improve numerical precision. – psihodelia Mar 26 '10 at 09:45
Please use `collections.defaultdict` instead of this. – S.Lott Mar 26 '10 at 10:08
1

Instead of the lambda expression you could use collections.defaultdict(int) to initialize freqs – Peter Hoffmann Mar 26 '10 at 10:31
Perhaps, it would be nice if you could remove the example with bare except altogether. It is an extremely bad way to do things in almost (that's, like, 99.99% almost) all cases. – shylent Mar 26 '10 at 11:03
`collections.Counter` is actually quite slow – SilentGhost Mar 26 '10 at 11:17
@SilentGhost, I see, I never did benchmark for that. may be they implement that in pure python. – YOU Mar 26 '10 at 11:23
I have made different tests on large (MBs of text) files and there is no decrease in time if I use collections and integers instead of try/except with floats. – psihodelia Mar 26 '10 at 12:11
@psihodelia, I think python's try/except might be highly optimized, so thats why you don't see any difference. how about giving a try with psyco, its supposed to do just-in-time compiling for your looping codes. – YOU Mar 26 '10 at 12:30
I've posted unicode-aware numpy-based python solution. It takes 0.17 seconds to process big.txt (vs. 0.07 seconds for simple ascii C version, vs. 2.72 seconds for collections.Counter). http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2525617#2525617 – jfs Mar 28 '10 at 09:21
@J.F. Thats great benchmarks. Actually I was writing C Char Counter Extentions to Python too. Its bad we don't have alert for new other answers and edits in SO. +1ed to yours btw. – YOU Mar 28 '10 at 09:31
I've added C Char Counter Extension in seperate answer, instead of adding this. http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2532564#2532564 – YOU Mar 28 '10 at 10:55
1

@shylent: There is nothing wrong with `try: d[key]+=1 \n except KeyError: \n d[key] = 1` It is easy to understand and it is fast if there is small number of different keys compared to total number of elements e.g., as in `big.txt` the ratio is ~100/1e6. Considering that the `for`-loop is implemented using `StopIteration` exception it can't be that bad to use `try/except`. – jfs Mar 30 '10 at 19:27
@J.F. Sebastian: I was talking specifically about the "bare" except, that is an except statement, that catches everything (I thought, I've made that rather clear). There is an extremely small number of valid use cases for that. – shylent Mar 31 '10 at 06:25

score 10 · Answer 3 · answered Mar 28 '10 at 10:35

I've written Char Counter C Extension to Python, looks like 300x faster than collections.Counter and 150x faster than collections.default(int)

C Char Counter : 0.0469999313354 s
93 chars {u' ': 1036511, u'$': 110, u'(': 1748, u',': 77675, u'0': 3064, u'4': 2417, u'8': 2527, u'<': 2, u'@': 8,

Here is Char Counter C Extension Codes

static PyObject *
CharCounter(PyObject *self, PyObject *args, PyObject *keywds)
{
    wchar_t *t1;unsigned l1=0;

    if (!PyArg_ParseTuple(args,"u#",&t1,&l1)) return NULL;

    PyObject *resultList,*itemTuple;

    for(unsigned i=0;i<=0xffff;i++)char_counter[i]=0;

    unsigned chlen=0;

    for(unsigned i=0;i<l1;i++){
        if(char_counter[t1[i]]==0)char_list[chlen++]=t1[i];
        char_counter[t1[i]]++;
    }

    resultList = PyList_New(0);

    for(unsigned i=0;i<chlen;i++){
        itemTuple = PyTuple_New(2);

        PyTuple_SetItem(itemTuple, 0,PyUnicode_FromWideChar(&char_list[i],1));
        PyTuple_SetItem(itemTuple, 1,PyInt_FromLong(char_counter[char_list[i]]));

        PyList_Append(resultList, itemTuple);
        Py_DECREF(itemTuple);

    };

    return resultList;
}

Where char_counter, and char_list are malloc-ated at module level, so no need to malloc every time when function calls.

char_counter=(unsigned*)malloc(sizeof(unsigned)*0x10000);
char_list=(wchar_t*)malloc(sizeof(wchar_t)*0x10000);

It returns a List with Tuples

[(u'T', 16282), (u'h', 287323), (u'e', 628234), (u' ', 1036511), (u'P', 8946), (u'r', 303977), (u'o', 382683), ...

To convert to dict format, just dict() will do.

dict(CharCounter(text))

PS: Benchmark included the time converting to dict

CharCounter accept only Python Unicode String u"", if the text is utf8, need to do .decode("utf8") in advance.

Input Supports Unicode until Basic Multilingual Plane (BMP) - 0x0000 to 0xFFFF

I've posted updated comparison (including your extension) http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2525617#2525617 — jfs, Mar 28 '10 at 21:19
I've added to the comparison Python extension written in Cython. It 2 times faster than the hand-written extension in C. http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2525617#2525617 — jfs, Jan 23 '11 at 02:04

score 6 · Answer 4 · answered Mar 26 '10 at 09:33

6

No it's not the fastest, because you know that the characters have a limited range and you could use a list and direct indexing, using the numeric representation of the character, to store the frequencies.

answered Mar 26 '10 at 09:33

Tuomas Pelkonen

7,559
2
28
31

You are right. It should give some tiny advance (no try/except checks). – psihodelia Mar 26 '10 at 09:38
1

Yes, a range limited to 2 ** 21 possibilities. – John Machin Mar 26 '10 at 13:13

score 5 · Answer 5 · answered Mar 26 '10 at 09:34

5

It is very, very hard to beat dict. It is very highly tuned since almost everything in Python is dict-based.

answered Mar 26 '10 at 09:34

Ignacio Vazquez-Abrams

699,552
132
1,235
1,283

1

"Tuned"? A dictionary is a hash and the lookup is O(1). Tuning doesn't mean much when you have an algorithm that's fundamentally so fast. – S.Lott Mar 26 '10 at 10:07
1

The hash algorithm itself has been tuned for most built-in types, given that the key can be any hashable type. – Ignacio Vazquez-Abrams Mar 26 '10 at 10:12
7

@S.Lott: In theory, only the asymptotic performance matters. In practice, the constant factors hidden by big-O notation also matter. – Daniel Stutzbach Mar 26 '10 at 10:28
2

@Daniel Stutsbach Philosoraptor concurs. – orokusaki Mar 26 '10 at 16:35

score 4 · Answer 6 · answered Mar 26 '10 at 09:33

I'm not familiar with python, but for finding frequencies, unless you know the range of frequencies (in which case you can use an array), dictionary is the way to go.
If you know your characters in a unicode, ASCII, etc. range, you can define an array with the correct number of values.
However, this will change the space complexity of this from O(n) to O(possible n), but you will earn a time complexity improvement from O(n*(dictionary extraction/insertion time)) to O(n).

Although I suggested the same answer, the big O is the same :-) — fortran, Mar 26 '10 at 11:04

score 2 · Answer 7 · answered Mar 26 '10 at 10:02

If you are really concerned about speed, you might consider first counting characters with integers and then obtaining frequencies through (float) division.

Here are the numbers:

python -mtimeit -s'x=0' 'x+=1'      
10000000 loops, best of 3: 0.0661 usec per loop

python -mtimeit -s'x=0.' 'x+=1.'
10000000 loops, best of 3: 0.0965 usec per loop

score 2 · Answer 8 · answered Mar 26 '10 at 11:01

well, you can do it in the old fashioned style... as we know that there are not too many different characters and they are contiguous, we can use a plain array (or list here) and use the characters ordinal numbering for indexing:

s = 1.0*len(text)
counts = [0]*256 # change this if working with unicode
for char in text: 
    freqs[ord(char)]+=1

freqs = dict((chr(i), v/s) for i,v in enumerate(counts) if v)

This will be probably faster, but just by a constant factor, both methods should have the same complexity.

`list`-based variant is 2 times *slower* than `try/except` variant (tested on big.txt). — jfs, Mar 28 '10 at 16:27

Chinmay Kanchi · Answer 9 · 2010-03-26T14:30:17.277

2

Using this code on Alice in Wonderland (163793 chars) and "The Bible, Douay-Rheims Version" (5649295 chars) from Project Gutenberg:

from collections import defaultdict
import timeit

def countchars():
    f = open('8300-8.txt', 'rb')
    #f = open('11.txt')
    s = f.read()
    f.close()
    charDict = defaultdict(int)
    for aChar in s:
        charDict[aChar] += 1


if __name__ == '__main__':
    tm = timeit.Timer('countchars()', 'from countchars import countchars')  
    print tm.timeit(10)

I get:

2.27324003315 #Alice in Wonderland
74.8686217403 #Bible

The ratio between the number of chars for both books is 0.029 and the ratio between the times is 0.030, so, the algorithm is O(n) with a very small constant factor. Fast enough for most (all?) purposes, I should think.

edited Mar 26 '10 at 14:30

answered Mar 26 '10 at 12:56

Chinmay Kanchi

54,755
21
79
110

1

You `defaultdict` could more simply be `defaultdict(int)`. There is no need for `defaultFactory()`. – Eric O Lebigot Mar 26 '10 at 13:26
Fixed. I didn't know that `int()` returns `0`. – Chinmay Kanchi Mar 26 '10 at 14:30

score 2 · Answer 10 · answered Mar 26 '10 at 16:30

2

If the data is in a single byte encoding you can use numpy to accelerate the count process:

import numpy as np

def char_freq(data):
    counts = np.bincount(np.frombuffer(data, dtype=np.byte))
    freqs = counts.astype(np.double) / len(data)
    return dict((chr(idx), freq) for idx, freq in enumerate(freqs) if freq > 0)

Some quick benchmarking shows that this is about 10x faster than aggregating to a defaultdict(int).

answered Mar 26 '10 at 16:30

Ants Aasma

48,030
12
84
89

I've posted numpy-based solution that supports unicode http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character/2525617#2525617 – jfs Mar 28 '10 at 09:12
This answer is highly ranked on the top answer, shouldn't the votes be higher? – Aaron Hall May 12 '14 at 04:24

score 1 · Answer 11 · answered Mar 26 '10 at 09:42

1

To avoid the try except overhead you can use a defaultdict.

answered Mar 26 '10 at 09:42

Xavier Combelle

9,594
4
24
50

score 1 · Answer 12 · answered Mar 26 '10 at 09:43

1

Small speed up will be usage of dict.setdefault method, that way you will not pay rather big price for every new encountered character:

for char in text:
    freq[char] = freq.setdefault(char, 0.0) + P

As a sidenote: having bare except: is considered very bad practice.

answered Mar 26 '10 at 09:43

Łukasz

30,017
4
30
32

2

Please use `collections.defaultdict` instead of this. It's simpler and faster still. Also, the floating-point in the question is **really** bad. – S.Lott Mar 26 '10 at 10:07
`setdefault` is almost 3 times *slower* than `try/except` variant tested on big.txt file. – jfs Mar 28 '10 at 16:25