212

I have a data structure which essentially amounts to a nested dictionary. Let's say it looks like this:

{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Now, maintaining and creating this is pretty painful; every time I have a new state/county/profession I have to create the lower layer dictionaries via obnoxious try/catch blocks. Moreover, I have to create annoying nested iterators if I want to go over all the values.

I could also use tuples as keys, like such:

{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

This makes iterating over the values very simple and natural, but it is more syntactically painful to do things like aggregations and looking at subsets of the dictionary (e.g. if I just want to go state-by-state).

Basically, sometimes I want to think of a nested dictionary as a flat dictionary, and sometimes I want to think of it indeed as a complex hierarchy. I could wrap this all in a class, but it seems like someone might have done this already. Alternatively, it seems like there might be some really elegant syntactical constructions to do this.

How could I do this better?

Addendum: I'm aware of setdefault() but it doesn't really make for clean syntax. Also, each sub-dictionary you create still needs to have setdefault() manually set.

martineau
  • 99,260
  • 22
  • 139
  • 249
YGA
  • 7,892
  • 13
  • 37
  • 46

21 Answers21

191

What is the best way to implement nested dictionaries in Python?

This is a bad idea, don't do it. Instead, use a regular dictionary and use dict.setdefault where apropos, so when keys are missing under normal usage you get the expected KeyError. If you insist on getting this behavior, here's how to shoot yourself in the foot:

Implement __missing__ on a dict subclass to set and return a new instance.

This approach has been available (and documented) since Python 2.5, and (particularly valuable to me) it pretty prints just like a normal dict, instead of the ugly printing of an autovivified defaultdict:

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
        return value                     # faster to return than dict lookup

(Note self[key] is on the left-hand side of assignment, so there's no recursion here.)

and say you have some data:

data = {('new jersey', 'mercer county', 'plumbers'): 3,
        ('new jersey', 'mercer county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'programmers'): 81,
        ('new jersey', 'middlesex county', 'salesmen'): 62,
        ('new york', 'queens county', 'plumbers'): 9,
        ('new york', 'queens county', 'salesmen'): 36}

Here's our usage code:

vividict = Vividict()
for (state, county, occupation), number in data.items():
    vividict[state][county][occupation] = number

And now:

>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

Criticism

A criticism of this type of container is that if the user misspells a key, our code could fail silently:

>>> vividict['new york']['queens counyt']
{}

And additionally now we'd have a misspelled county in our data:

>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36},
              'queens counyt': {}}}

Explanation:

We're just providing another nested instance of our class Vividict whenever a key is accessed but missing. (Returning the value assignment is useful because it avoids us additionally calling the getter on the dict, and unfortunately, we can't return it as it is being set.)

Note, these are the same semantics as the most upvoted answer but in half the lines of code - nosklo's implementation:

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Demonstration of Usage

Below is just an example of how this dict could be easily used to create a nested dict structure on the fly. This can quickly create a hierarchical tree structure as deeply as you might want to go.

import pprint

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)()
        return value

d = Vividict()

d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)

Which outputs:

{'fizz': {'buzz': {}},
 'foo': {'bar': {}, 'baz': {}},
 'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}

And as the last line shows, it pretty prints beautifully and in order for manual inspection. But if you want to visually inspect your data, implementing __missing__ to set a new instance of its class to the key and return it is a far better solution.

Other alternatives, for contrast:

dict.setdefault

Although the asker thinks this isn't clean, I find it preferable to the Vividict myself.

d = {} # or dict()
for (state, county, occupation), number in data.items():
    d.setdefault(state, {}).setdefault(county, {})[occupation] = number

and now:

>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
                                  'programmers': 81},
                'middlesex county': {'programmers': 81,
                                     'salesmen': 62}},
 'new york': {'queens county': {'plumbers': 9,
                                'salesmen': 36}}}

A misspelling would fail noisily, and not clutter our data with bad information:

>>> d['new york']['queens counyt']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'

Additionally, I think setdefault works great when used in loops and you don't know what you're going to get for keys, but repetitive usage becomes quite burdensome, and I don't think anyone would want to keep up the following:

d = dict()

d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})

Another criticism is that setdefault requires a new instance whether it is used or not. However, Python (or at least CPython) is rather smart about handling unused and unreferenced new instances, for example, it reuses the location in memory:

>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)

An auto-vivified defaultdict

This is a neat looking implementation, and usage in a script that you're not inspecting the data on would be as useful as implementing __missing__:

from collections import defaultdict

def vivdict():
    return defaultdict(vivdict)

But if you need to inspect your data, the results of an auto-vivified defaultdict populated with data in the same way looks like this:

>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint; 
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict 
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar': 
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function 
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>, 
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at 
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})

This output is quite inelegant, and the results are quite unreadable. The solution typically given is to recursively convert back to a dict for manual inspection. This non-trivial solution is left as an exercise for the reader.

Performance

Finally, let's look at performance. I'm subtracting the costs of instantiation.

>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747

Based on performance, dict.setdefault works the best. I'd highly recommend it for production code, in cases where you care about execution speed.

If you need this for interactive use (in an IPython notebook, perhaps) then performance doesn't really matter - in which case, I'd go with Vividict for readability of the output. Compared to the AutoVivification object (which uses __getitem__ instead of __missing__, which was made for this purpose) it is far superior.

Conclusion

Implementing __missing__ on a subclassed dict to set and return a new instance is slightly more difficult than alternatives but has the benefits of

  • easy instantiation
  • easy data population
  • easy data viewing

and because it is less complicated and more performant than modifying __getitem__, it should be preferred to that method.

Nevertheless, it has drawbacks:

  • Bad lookups will fail silently.
  • The bad lookup will remain in the dictionary.

Thus I personally prefer setdefault to the other solutions, and have in every situation where I have needed this sort of behavior.

Community
  • 1
  • 1
Aaron Hall
  • 291,450
  • 75
  • 369
  • 312
  • Excellent answer! Is there any way to specify a finite depth and a leaf type for a `Vividict`? E.g. `3` and `list` for a dict of dict of dict of lists which could be populated with `d['primary']['secondary']['tertiary'].append(element)`. I could define 3 different classes for each depth but I'd love to find a cleaner solution. – Eric Duminil Dec 27 '17 at 17:22
  • @EricDuminil `d['primary']['secondary'].setdefault('tertiary', []).append('element')` - ?? Thanks for the compliment, but let me be honest - I never actually use `__missing__` - I always use `setdefault`. I should probably update my conclusion/intro... – Aaron Hall Dec 27 '17 at 18:57
  • @AaronHall The correct behavior is the code should create a dict if needed. In this case by overriding the previous assigned value. – nehem Mar 13 '19 at 01:18
  • @AaronHall Also can you help me to understand what is meant by `The bad lookup will remain in the dictionary.` as I'm considering using this solution?. Much appreciated. Thx – nehem Mar 13 '19 at 02:56
  • @AaronHall The problem with it would fail `setdefault` when it nested more than two levels of depth. Looks like no structure in Python can offer true vivification as described. I had to settle for two stating methods one for `get_nested` & one for `set_nested` which accept a reference for dict and list of nested attributes. – nehem Mar 13 '19 at 04:56
188
class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Testing:

a = AutoVivification()

a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6

print a

Output:

{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}
nosklo
  • 193,422
  • 54
  • 273
  • 281
  • Anybody have this problem when they moved to python 3.x? https://stackoverflow.com/questions/54622935/problem-loading-autovivification-file-when-moving-from-python-2-7-to-3-6-keyerr – jason Feb 11 '19 at 04:38
  • @jason `pickle` is terrible between python versions. Avoid using it to store data you want to keep. Use it only for caches and stuff you can dump and regenerate at will. Not as a long term storage or serialization method. – nosklo Feb 11 '19 at 12:09
  • What do you use to store these objects? My autovivification object contains just pandas dataframes and string. – jason Feb 11 '19 at 14:39
  • @jason Depending on the data, I like to use JSON, csv files, or even a `sqlite` database to store it. – nosklo Feb 12 '19 at 06:37
32

Just because I haven't seen one this small, here's a dict that gets as nested as you like, no sweat:

# yo dawg, i heard you liked dicts                                                                      
def yodict():
    return defaultdict(yodict)
the Tin Man
  • 150,910
  • 39
  • 198
  • 279
paint can
  • 329
  • 3
  • 2
  • 4
    @wberry: Actually all you need is `yodict = lambda: defaultdict(yodict)`. – martineau Aug 14 '13 at 18:27
  • 1
    The accepted version is a subclass of `dict`, so to be fully equivalent we would need `x = Vdict(a=1, b=2)` to work. – wberry Aug 15 '13 at 04:46
  • @wberry: Irrespective of what's in the accepted answer, being a subclass of `dict` wasn't a requirement stated by the OP, who only asked for the "best way" to implement them -- and besides, it doesn't/shouldn't matter that much in Python anyway. – martineau Mar 21 '14 at 17:43
25

You could create a YAML file and read it in using PyYaml.

Step 1: Create a YAML file, "employment.yml":

new jersey:
  mercer county:
    pumbers: 3
    programmers: 81
  middlesex county:
    salesmen: 62
    programmers: 81
new york:
  queens county:
    plumbers: 9
    salesmen: 36

Step 2: Read it in Python

import yaml
file_handle = open("employment.yml")
my_shnazzy_dictionary = yaml.safe_load(file_handle)
file_handle.close()

and now my_shnazzy_dictionary has all your values. If you needed to do this on the fly, you can create the YAML as a string and feed that into yaml.safe_load(...).

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
Pete
  • 9,450
  • 7
  • 50
  • 57
  • 4
    YAML is my definitely my choice for inputting lots of deeply nested data (and configuration files, databaes mockups, etc...). If the OP doesn't want extra files lying around, just use a regular Python string in some file and parse that with YAML. – kmelvn Mar 11 '09 at 20:49
  • Good point on creating YAML strings: This would be a much cleaner approach than using the "tempfile" module repeatedly. – Pete Mar 11 '09 at 20:52
19

Since you have a star-schema design, you might want to structure it more like a relational table and less like a dictionary.

import collections

class Jobs( object ):
    def __init__( self, state, county, title, count ):
        self.state= state
        self.count= county
        self.title= title
        self.count= count

facts = [
    Jobs( 'new jersey', 'mercer county', 'plumbers', 3 ),
    ...

def groupBy( facts, name ):
    total= collections.defaultdict( int )
    for f in facts:
        key= getattr( f, name )
        total[key] += f.count

That kind of thing can go a long way to creating a data warehouse-like design without the SQL overheads.

S.Lott
  • 359,791
  • 75
  • 487
  • 757
16

If the number of nesting levels is small, I use collections.defaultdict for this:

from collections import defaultdict

def nested_dict_factory(): 
  return defaultdict(int)
def nested_dict_factory2(): 
  return defaultdict(nested_dict_factory)
db = defaultdict(nested_dict_factory2)

db['new jersey']['mercer county']['plumbers'] = 3
db['new jersey']['mercer county']['programmers'] = 81

Using defaultdict like this avoids a lot of messy setdefault(), get(), etc.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
user26294
  • 4,737
  • 2
  • 20
  • 18
10

This is a function that returns a nested dictionary of arbitrary depth:

from collections import defaultdict
def make_dict():
    return defaultdict(make_dict)

Use it like this:

d=defaultdict(make_dict)
d["food"]["meat"]="beef"
d["food"]["veggie"]="corn"
d["food"]["sweets"]="ice cream"
d["animal"]["pet"]["dog"]="collie"
d["animal"]["pet"]["cat"]="tabby"
d["animal"]["farm animal"]="chicken"

Iterate through everything with something like this:

def iter_all(d,depth=1):
    for k,v in d.iteritems():
        print "-"*depth,k
        if type(v) is defaultdict:
            iter_all(v,depth+1)
        else:
            print "-"*(depth+1),v

iter_all(d)

This prints out:

- food
-- sweets
--- ice cream
-- meat
--- beef
-- veggie
--- corn
- animal
-- pet
--- dog
---- labrador
--- cat
---- tabby
-- farm animal
--- chicken

You might eventually want to make it so that new items can not be added to the dict. It's easy to recursively convert all these defaultdicts to normal dicts.

def dictify(d):
    for k,v in d.iteritems():
        if isinstance(v,defaultdict):
            d[k] = dictify(v)
    return dict(d)
JnBrymn
  • 21,527
  • 26
  • 95
  • 140
7

I find setdefault quite useful; It checks if a key is present and adds it if not:

d = {}
d.setdefault('new jersey', {}).setdefault('mercer county', {})['plumbers'] = 3

setdefault always returns the relevant key, so you are actually updating the values of 'd' in place.

When it comes to iterating, I'm sure you could write a generator easily enough if one doesn't already exist in Python:

def iterateStates(d):
    # Let's count up the total number of "plumbers" / "dentists" / etc.
    # across all counties and states
    job_totals = {}

    # I guess this is the annoying nested stuff you were talking about?
    for (state, counties) in d.iteritems():
        for (county, jobs) in counties.iteritems():
            for (job, num) in jobs.iteritems():
                # If job isn't already in job_totals, default it to zero
                job_totals[job] = job_totals.get(job, 0) + num

    # Now return an iterator of (job, number) tuples
    return job_totals.iteritems()

# Display all jobs
for (job, num) in iterateStates(d):
    print "There are %d %s in total" % (job, num)
the Tin Man
  • 150,910
  • 39
  • 198
  • 279
andygeers
  • 6,597
  • 9
  • 45
  • 63
  • I like this solution but when I try: count.setdefault(a, {}).setdefault(b, {}).setdefault(c, 0) += 1 I get "illegal expression for augmented assignment" – dfrankow Mar 08 '11 at 19:31
7

As others have suggested, a relational database could be more useful to you. You can use a in-memory sqlite3 database as a data structure to create tables and then query them.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE jobs (state, county, title, count)')

c.executemany('insert into jobs values (?, ?, ?, ?)', [
    ('New Jersey', 'Mercer County',    'Programmers', 81),
    ('New Jersey', 'Mercer County',    'Plumbers',     3),
    ('New Jersey', 'Middlesex County', 'Programmers', 81),
    ('New Jersey', 'Middlesex County', 'Salesmen',    62),
    ('New York',   'Queens County',    'Salesmen',    36),
    ('New York',   'Queens County',    'Plumbers',     9),
])

# some example queries
print list(c.execute('SELECT * FROM jobs WHERE county = "Queens County"'))
print list(c.execute('SELECT SUM(count) FROM jobs WHERE title = "Programmers"'))

This is just a simple example. You could define separate tables for states, counties and job titles.

Roberto Bonvallet
  • 27,307
  • 5
  • 37
  • 57
5

collections.defaultdict can be sub-classed to make a nested dict. Then add any useful iteration methods to that class.

>>> from collections import defaultdict
>>> class nesteddict(defaultdict):
    def __init__(self):
        defaultdict.__init__(self, nesteddict)
    def walk(self):
        for key, value in self.iteritems():
            if isinstance(value, nesteddict):
                for tup in value.walk():
                    yield (key,) + tup
            else:
                yield key, value


>>> nd = nesteddict()
>>> nd['new jersey']['mercer county']['plumbers'] = 3
>>> nd['new jersey']['mercer county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['programmers'] = 81
>>> nd['new jersey']['middlesex county']['salesmen'] = 62
>>> nd['new york']['queens county']['plumbers'] = 9
>>> nd['new york']['queens county']['salesmen'] = 36
>>> for tup in nd.walk():
    print tup


('new jersey', 'mercer county', 'programmers', 81)
('new jersey', 'mercer county', 'plumbers', 3)
('new jersey', 'middlesex county', 'programmers', 81)
('new jersey', 'middlesex county', 'salesmen', 62)
('new york', 'queens county', 'salesmen', 36)
('new york', 'queens county', 'plumbers', 9)
the Tin Man
  • 150,910
  • 39
  • 198
  • 279
A. Coady
  • 44,841
  • 8
  • 28
  • 36
  • 1
    This is the answer that comes closest to what I was looking for. But ideally there would be all sorts of helper functions, e.g. walk_keys() or such. I'm surprised there's nothing in the standard libraries to do this. – YGA Mar 14 '09 at 05:58
4

You can use Addict: https://github.com/mewwts/addict

>>> from addict import Dict
>>> my_new_shiny_dict = Dict()
>>> my_new_shiny_dict.a.b.c.d.e = 2
>>> my_new_shiny_dict
{'a': {'b': {'c': {'d': {'e': 2}}}}}
JnBrymn
  • 21,527
  • 26
  • 95
  • 140
4

defaultdict() is your friend!

For a two dimensional dictionary you can do:

d = defaultdict(defaultdict)
d[1][2] = 3

For more dimensions you can:

d = defaultdict(lambda :defaultdict(defaultdict))
d[1][2][3] = 4
Acumenus
  • 41,481
  • 14
  • 116
  • 107
Paula
  • 82
  • 1
  • 1
  • This answer works for only three levels at best. For arbitrary levels, consider [this answer](https://stackoverflow.com/a/8702435/832230). – Acumenus Jun 03 '17 at 21:59
4

As for "obnoxious try/catch blocks":

d = {}
d.setdefault('key',{}).setdefault('inner key',{})['inner inner key'] = 'value'
print d

yields

{'key': {'inner key': {'inner inner key': 'value'}}}

You can use this to convert from your flat dictionary format to structured format:

fd = {('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'programmers'): 81,
 ('new jersey', 'middlesex county', 'salesmen'): 62,
 ('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

for (k1,k2,k3), v in fd.iteritems():
    d.setdefault(k1, {}).setdefault(k2, {})[k3] = v
vartec
  • 118,560
  • 34
  • 206
  • 238
3

For easy iterating over your nested dictionary, why not just write a simple generator?

def each_job(my_dict):
    for state, a in my_dict.items():
        for county, b in a.items():
            for job, value in b.items():
                yield {
                    'state'  : state,
                    'county' : county,
                    'job'    : job,
                    'value'  : value
                }

So then, if you have your compilicated nested dictionary, iterating over it becomes simple:

for r in each_job(my_dict):
    print "There are %d %s in %s, %s" % (r['value'], r['job'], r['county'], r['state'])

Obviously your generator can yield whatever format of data is useful to you.

Why are you using try catch blocks to read the tree? It's easy enough (and probably safer) to query whether a key exists in a dict before trying to retrieve it. A function using guard clauses might look like this:

if not my_dict.has_key('new jersey'):
    return False

nj_dict = my_dict['new jersey']
...

Or, a perhaps somewhat verbose method, is to use the get method:

value = my_dict.get('new jersey', {}).get('middlesex county', {}).get('salesmen', 0)

But for a somewhat more succinct way, you might want to look at using a collections.defaultdict, which is part of the standard library since python 2.5.

import collections

def state_struct(): return collections.defaultdict(county_struct)
def county_struct(): return collections.defaultdict(job_struct)
def job_struct(): return 0

my_dict = collections.defaultdict(state_struct)

print my_dict['new jersey']['middlesex county']['salesmen']

I'm making assumptions about the meaning of your data structure here, but it should be easy to adjust for what you actually want to do.

SpoonMeiser
  • 18,419
  • 8
  • 45
  • 63
2

I like the idea of wrapping this in a class and implementing __getitem__ and __setitem__ such that they implemented a simple query language:

>>> d['new jersey/mercer county/plumbers'] = 3
>>> d['new jersey/mercer county/programmers'] = 81
>>> d['new jersey/mercer county/programmers']
81
>>> d['new jersey/mercer country']
<view which implicitly adds 'new jersey/mercer county' to queries/mutations>

If you wanted to get fancy you could also implement something like:

>>> d['*/*/programmers']
<view which would contain 'programmers' entries>

but mostly I think such a thing would be really fun to implement :D

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
Aaron Maenpaa
  • 105,677
  • 10
  • 91
  • 107
  • I think this is a bad idea -- you can never predict the syntax of keys. You would still override __getitem__ and __setitem__ but have them take tuples. – YGA Mar 11 '09 at 17:38
  • 3
    @YGA You're probably right, but it's fun to think about implementing mini languages like this. – Aaron Maenpaa Mar 11 '09 at 18:25
1
class JobDb(object):
    def __init__(self):
        self.data = []
        self.all = set()
        self.free = []
        self.index1 = {}
        self.index2 = {}
        self.index3 = {}

    def _indices(self,(key1,key2,key3)):
        indices = self.all.copy()
        wild = False
        for index,key in ((self.index1,key1),(self.index2,key2),
                                             (self.index3,key3)):
            if key is not None:
                indices &= index.setdefault(key,set())
            else:
                wild = True
        return indices, wild

    def __getitem__(self,key):
        indices, wild = self._indices(key)
        if wild:
            return dict(self.data[i] for i in indices)
        else:
            values = [self.data[i][-1] for i in indices]
            if values:
                return values[0]

    def __setitem__(self,key,value):
        indices, wild = self._indices(key)
        if indices:
            for i in indices:
                self.data[i] = key,value
        elif wild:
            raise KeyError(k)
        else:
            if self.free:
                index = self.free.pop(0)
                self.data[index] = key,value
            else:
                index = len(self.data)
                self.data.append((key,value))
                self.all.add(index)
            self.index1.setdefault(key[0],set()).add(index)
            self.index2.setdefault(key[1],set()).add(index)
            self.index3.setdefault(key[2],set()).add(index)

    def __delitem__(self,key):
        indices,wild = self._indices(key)
        if not indices:
            raise KeyError
        self.index1[key[0]] -= indices
        self.index2[key[1]] -= indices
        self.index3[key[2]] -= indices
        self.all -= indices
        for i in indices:
            self.data[i] = None
        self.free.extend(indices)

    def __len__(self):
        return len(self.all)

    def __iter__(self):
        for key,value in self.data:
            yield key

Example:

>>> db = JobDb()
>>> db['new jersey', 'mercer county', 'plumbers'] = 3
>>> db['new jersey', 'mercer county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'programmers'] = 81
>>> db['new jersey', 'middlesex county', 'salesmen'] = 62
>>> db['new york', 'queens county', 'plumbers'] = 9
>>> db['new york', 'queens county', 'salesmen'] = 36

>>> db['new york', None, None]
{('new york', 'queens county', 'plumbers'): 9,
 ('new york', 'queens county', 'salesmen'): 36}

>>> db[None, None, 'plumbers']
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new york', 'queens county', 'plumbers'): 9}

>>> db['new jersey', 'mercer county', None]
{('new jersey', 'mercer county', 'plumbers'): 3,
 ('new jersey', 'mercer county', 'programmers'): 81}

>>> db['new jersey', 'middlesex county', 'programmers']
81

>>>

Edit: Now returning dictionaries when querying with wild cards (None), and single values otherwise.

Markus Jarderot
  • 79,575
  • 18
  • 131
  • 135
  • Why return lists? Seems it should either return a dictionary (so you know what each number represents) or a sum (since that's all you can really do with the list). – Ben Blank Mar 11 '09 at 20:52
1

Unless your dataset is going to stay pretty small, you might want to consider using a relational database. It will do exactly what you want: make it easy to add counts, selecting subsets of counts, and even aggregate counts by state, county, occupation, or any combination of these.

allyourcode
  • 19,438
  • 17
  • 73
  • 100
0

I have a similar thing going. I have a lot of cases where I do:

thedict = {}
for item in ('foo', 'bar', 'baz'):
  mydict = thedict.get(item, {})
  mydict = get_value_for(item)
  thedict[item] = mydict

But going many levels deep. It's the ".get(item, {})" that's the key as it'll make another dictionary if there isn't one already. Meanwhile, I've been thinking of ways to deal with this better. Right now, there's a lot of

value = mydict.get('foo', {}).get('bar', {}).get('baz', 0)

So instead, I made:

def dictgetter(thedict, default, *args):
  totalargs = len(args)
  for i,arg in enumerate(args):
    if i+1 == totalargs:
      thedict = thedict.get(arg, default)
    else:
      thedict = thedict.get(arg, {})
  return thedict

Which has the same effect if you do:

value = dictgetter(mydict, 0, 'foo', 'bar', 'baz')

Better? I think so.

uzi
  • 4,549
  • 2
  • 14
  • 11
0

You can use recursion in lambdas and defaultdict, no need to define names:

a = defaultdict((lambda f: f(f))(lambda g: lambda:defaultdict(g(g))))

Here's an example:

>>> a['new jersey']['mercer county']['plumbers']=3
>>> a['new jersey']['middlesex county']['programmers']=81
>>> a['new jersey']['mercer county']['programmers']=81
>>> a['new jersey']['middlesex county']['salesmen']=62
>>> a
defaultdict(<function __main__.<lambda>>,
        {'new jersey': defaultdict(<function __main__.<lambda>>,
                     {'mercer county': defaultdict(<function __main__.<lambda>>,
                                  {'plumbers': 3, 'programmers': 81}),
                      'middlesex county': defaultdict(<function __main__.<lambda>>,
                                  {'programmers': 81, 'salesmen': 62})})})
topkara
  • 836
  • 9
  • 15
0

I used to use this function. its safe, quick, easily maintainable.

def deep_get(dictionary, keys, default=None):
    return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)

Example :

>>> from functools import reduce
>>> def deep_get(dictionary, keys, default=None):
...     return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)
...
>>> person = {'person':{'name':{'first':'John'}}}
>>> print (deep_get(person, "person.name.first"))
John
>>> print (deep_get(person, "person.name.lastname"))
None
>>> print (deep_get(person, "person.name.lastname", default="No lastname"))
No lastname
>>>
Yuda Prawira
  • 10,249
  • 7
  • 43
  • 51
0

For the following (copied from above) is there a way to implement the append function. I am trying to use a nested dictionary to store values as array.

class Vividict(dict):
    def __missing__(self, key):
        value = self[key] = type(self)() # retain local pointer to value
    return value  

My current implementation is as follows:

totalGeneHash=Vividict()
        
for keys in GenHash:
    for second in GenHash[keys]:
        if keys in sampleHash:
            total_val = GenHash[keys][second]
                totalGeneHash[gene][keys].append(total_val)
This is the error I get: AttributeError: 'Vividict' object has no attribute 'append'
Mdhale
  • 663
  • 2
  • 9
  • 20