4

Lately I've been refactoring some bash scripts into Python 3.7 as both a learning exercise and for real use in a project. The resulting implementation uses a very large ordered dictionary, say around 2 to 3 million entries. Storing the data this way has some significant advantages, reducing code complexity and processing time. However, there is one task that eludes me: how to step through the dictionary from a known starting point.

If I were doing this in C, I would make pointer to the desired start point and walk the pointer. If there is any analogous operation in Python, I don't know it and can't find it. All the techniques I found seem to duplicate the some/all of information into a new list, which would be time consuming and waste a lot of memory in my application. It also seems that you can't slice a dictionary, even though they are now ordered by default.

Consider this contrived example dictionary of the Latin alphabet, whose strangely keyed entries are grouped by vowels and consonants, and the entries within each group are sorted alphabetically:

dd = { #   key:  (  phonetic, letter, ascii, ebcedic, baudot, morse,  hollerith, strokes,  kind     )
    4296433290:  ( 'Alfa',     'A',    65,     193,     3,    '.-',     (12,1),     3,    'vowl'    ),
    5046716526:  ( 'Echo',     'E',    69,     197,     1,    '.',      (12,5),     4,    'vowl'    ),
    5000200584:  ( 'India',    'I',    73,     201,     6,    '..',     (12,9),     3,    'vowl'    ),
    5000971262:  ( 'Oscar',    'O',    79,     214,     24,   '---',    (11,6),     1,    'vowl'    ),
    5000921625:  ( 'Uniform',  'U',    85,     228,     7,    '..-',    (0,4),      1,    'vowl'    ),
    4297147083:  ( 'Yankee',   'Y',    89,     232,     21,   '-.--',   (0,8),      3,    'vowl'    ),
    4297256046:  ( 'Bravo',    'B',    66,     194,     25,   '-...',   (12,2),     3,    'cons'    ),
    4298140290:  ( 'Charlie',  'C',    67,     195,     14,   '-.-.',   (12,3),     1,    'cons'    ),
    5036185622:  ( 'Delta',    'D',    68,     196,     9,    '-..',    (12,4),     2,    'cons'    ),
    5036854221:  ( 'Foxtrot',  'F',    70,     198,     13,   '..-.',   (12,6),     3,    'cons'    ),
    5037458768:  ( 'Golf',     'G',    71,     199,     26,   '--.',    (12,7),     2,    'cons'    ),
    5035556903:  ( 'Hotel',    'H',    72,     200,     20,   '....',   (12,8),     3,    'cons'    ),
    5037119814:  ( 'Juliett',  'J',    74,     209,     11,   '.---',   (11,1),     2,    'cons'    ),
    5035556831:  ( 'Kilo',     'K',    75,     210,     15,   '-.-',    (11,2),     3,    'cons'    ),
    4296755665:  ( 'Lima',     'L',    76,     211,     18,   '.-..',   (11,3),     2,    'cons'    ),
    5035557110:  ( 'Mike',     'M',    77,     212,     28,   '--',     (11,4),     4,    'cons'    ),
    5037118125:  ( 'November', 'N',    78,     213,     12,   '-.',     (11,5),     3,    'cons'    ),
    5000423356:  ( 'Papa',     'P',    80,     215,     22,   '.--.',   (11,7),     2,    'cons'    ),
    5000923300:  ( 'Quebec',   'Q',    81,     216,     23,   '--.-',   (11,8),     2,    'cons'    ),
    5000969482:  ( 'Romeo',    'R',    82,     217,     10,   '.-.',    (11,9),     3,    'cons'    ),
    5035943840:  ( 'Sierra',   'S',    83,     226,     5,    '...',    (0,2),      1,    'cons'    ),
    5045251209:  ( 'Tango',    'T',    84,     227,     16,   '-',      (0,3),      2,    'cons'    ),
    5000168680:  ( 'Victor',   'V',    86,     229,     30,   '...-',   (0,5),      2,    'cons'    ),
    4296684445:  ( 'Whiskey',  'W',    87,     230,     19,   '.--',    (0,6),      4,    'cons'    ),
    5000923277:  ( 'Xray',     'X',    88,     231,     29,   '-..-',   (0,7),      2,    'cons'    ),
    4296215569:  ( 'Zulu',     'Z',    90,     233,     17,   '--..',   (0,9),      3,    'cons'    ),
}

And let's say I want to perform some processing on the consonants. And since the processing takes a lot of time (think days), I would like to do it in chunks. In this case, let's say 4 consonants at a time. I know ahead of time the keys for the beginning of a group, for example:

vowlbeg = 4296433290 # key of first vowel
consbeg = 4297256046 # key of first consonant

But I can't figure out how to take advantage of this foreknowledge. For example, to process the 8th through 11th consonant, the best I can do is:

beg = 8 # begin processing with 8th consonant
end = 12 # end processing with 11th consonant
kind = 'cons' # desired group
i=-1
for d in dd.items():
    if d[1][-1] is not kind: continue
    i += 1
    if i < beg: continue
    if i >= end: break
    print('processing:', i, d)

Which gives the desired results, albeit a bit slowly because I'm walking through the whole dictionary, from the beginning, until I encounter the desired entries.

processing: 8 (5035556831, ('Kilo', 'K', 75, 210, 15, '-.-', (11, 2), 3, 'cons'))
processing: 9 (4296755665, ('Lima', 'L', 76, 211, 18, '.-..', (11, 3), 2, 'cons'))
processing: 10 (5035557110, ('Mike', 'M', 77, 212, 28, '--', (11, 4), 4, 'cons'))
processing: 11 (5037118125, ('November', 'N', 78, 213, 12, '-.', (11, 5), 3, 'cons'))

I think I can express this loop more compactly using list or maybe dictionary comprehensions, but it seems that would create a huge duplicate in memory. Maybe the method above does that, too, I'm not 100% sure.

Things I know about my ordered dictionary

  • the groups, e.g., vowels and consonants, are indeed grouped and not scattered.
  • within each group, the entries are sorted in a known, desired order,
  • the beginning key of each group

Q: Is there a better way to do this? My backup plan is to just bite the bullet and keep a duplicate set of tuples, one per group, in order to be able to slice it. But that will essentially double my memory, as best I understand it.

Note: It's not evident from this silly example, but being able to access entries by keys in a single dictionary is a HUGE advantage in my application.

Netwave
  • 23,907
  • 4
  • 31
  • 58
  • 1
    Dictionaries are made for key lookups. You are trying to do a lookup by a (part of the) value. You should implement an index dictionary, that maps your values to lookup to a sequence of keys where they are found. – Klaus D. May 03 '19 at 04:41
  • 1
    Do you ever access the data by key? If not, keep the “ordered” part and make it a list. Even if so, you can map the keys to their indices in order to have both modes of access with no duplication. – Davis Herring May 03 '19 at 06:43
  • I do access the data by key, in fact, that usage is the main core of the algorithm. The need to walk through the list in sequence is a minor, administrative task. – TheStumbler May 03 '19 at 07:01

4 Answers4

3

Rather than making a copy of the entire dictionary, there is an easier scheme in which you'll just need to make a copy of all the keys in another linked-list.

dd_list = LinkedList("4296433290", "5046716526", "5000200584", ... "4296215569")

And in the original dictionary, in each of the entries, just also keep a reference to the linked-list entry corresponding to that key:

dd = { 
    4296433290:  ( <reference to the linked-list entry of 4296433290>, 'Alfa', ...),
    5046716526:  ( <reference to the linked-list entry of 5046716526>, 'Echo', ...),
    .....
    .....
    .....
    4296215569:  ( <reference to the linked-list entry of 4296215569>, 'Zulu', ...)
}

Now if you want to iterate the 3 entries at a distance of 5 entries from 4297256046, you just need to do:

entry_iterator = dd['4297256046'][0]
i = 0
while i < 5:
    # Skip 5 entries
    entry_iterator = entry_iterator.next()
    i += 1

num_iterations = 0
while num_iterations < 3:
    key = entry_iterator.value
    entry = dd[key]
    process_entry(entry)
    entry_iterator = entry_iterator.next()
    num_iterations += 1

Now the reason I mentioned linked-list was so that in case you want to delete any entries from the map, you'll be also able to delete the corresponding entry from the linked list in O(1) time.
In case there are no deletions, you can just use a regular array, and keep the integer array indices as the <reference to the linked-list entry of ...>.

Note that Python by-default does not have any linked-list data structure. However, you'll be able to find loads of high-quality implementations online.

EDIT:

Sample code for the array case:

dd_list = ["4296433290", "5046716526", "5000200584", ... "4296215569"]

dd = { 
    4296433290:  ( 0, 'Alfa', ...),
    5046716526:  ( 1, 'Echo', ...),
    .....
    .....
    .....
    4296215569:  ( 25, 'Zulu', ...)
}

entry_index = dd['4297256046'][0]
# Skip 5 entries
entry_index += 5

num_iterations = 0
while num_iterations < 3:
    key = dd_list[entry_index]
    entry = dd[key]
    process_entry(entry)
    entry_index += 1
    num_iterations += 1

Anmol Singh Jaggi
  • 7,318
  • 1
  • 31
  • 65
  • I'm not sure I grasp your example code, but your concept is fine. I can duplicate the keys in a separate list. I see no reason for a linked list in my application, as the dictionary is essentially a constant. – TheStumbler May 03 '19 at 07:10
  • That code is just pseudo-code as there is no built-in linked-list in Python. If it is constant, then you can simply use an array instead of linked-list. I'll edit the code for the array case. – Anmol Singh Jaggi May 03 '19 at 07:19
  • Check the edit. I have also improved the old code. There was one mistake. – Anmol Singh Jaggi May 03 '19 at 07:26
2

For a simple solution using the Python built-ins, you can create a list of keys and then start from any point in the list at the expense of some memory usage for materializing the list. See below for an interactive session demonstrating the point.

It should be easy to do a loop over any range of keys using this technique.

1> data = {id: (id, "a") for id in range(10)}

2> data
{0: (0, 'a'), 1: (1, 'a'), 2: (2, 'a'), 3: (3, 'a'), 4: (4, 'a'), 5: (5, 'a'), 6: (6, 'a'), 7: (7, 'a'), 8: (8, 'a'), 9: (9, 'a')}

3> data.keys()
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

4> data.keys()[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'dict_keys' object does not support indexing

5> keys = list(data.keys())

6> keys[5]
5

7> data[keys[5]]
(5, 'a')
  • Step 1: create some sample data similar to yours
  • Step 2: Demonstrate the structure
  • Step 3: Get the dict_keys for the structure
  • Step 4: Demonstrate that you can't jump to a specific point in the list in the dict_keys native form
  • Step 5: Materialise the dict_keys as an actual list structure
  • Step 6: Demonstrate getting a key from anywhere in the list
  • Step 7: Pull the data from the dict using an arbitrary key
Steve McCartney
  • 181
  • 1
  • 3
1

from experience, working with a large volume of data like this through a loop is impossible because that means you're using up at least 2 times the size of the dictionary (from experience it uses up twice the amount of RAM of the byte size of the dictionary).

A couple of suggestions:

  1. Look into storing this into a dataframe. There's a reason why the pandas package is widely adopted: it uses backend optimizations (someone correct me if I'm wrong: numpy, and by extension pandas, uses some C-style or actual C compilations) that trumps whatever base Python can do. This would be a fairly easy task with either pandas or dask and would perform reasonably well.

    # file.py
    import pandas as pd
    
    cols =  ['key', 'phonetic', 'letter', 'ascii', 'ebcedic', 'baudot', 'morse',  'hollerith', 'strokes',  'kind']
    test = pd.DataFrame(dd).transpose().reset_index()
    
    test.columns = cols
    
    def get_letters(begin, end, kind):
        return test[test['kind'] == kind].reset_index(drop=True).iloc[begin-1:end-1]
    
    output = get_letters(8,12,'cons')
    
    final = output.set_index('key').transpose().to_dict('list')
    
    # runtime >>> mean 6.82 ms, std: 93.9 us
    
  2. If you're intent on using base Python structures, comprehensions is definitely the way to go. When you're trying to create a new "group" Python object (like lists, dicts or tuples) from another group Python object, comprehensions often scale way better than the standard "loop and append" tactic. the if-else loops should be left to things where you actually aren't creating new grouped objects. Even when you have some complicated control flow and logic to do before creating the new grouped object, i always elect to use comprehensions, and often just create "helper" functions for readability. I'd do it this way:

    def helper(dictionary, begin, end, cons):
        filtered = {k:v for k,v in dictionary.items() if v[8] == 'cons'}
    
        return [d for n, d in enumerate(filtered.values()) if n in range(begin-1, end-1)]
    
    helper(dd,8,12,'cons')
    
    # runtime >>> mean: 1.61ms, std: 58.5 us
    

Note: although the runtime seems to show that base Python as a faster mechanism, I'm confident saying that on bigger dictionaries, the pandas / dask method will outperform the base code

zero
  • 1,115
  • 10
  • 17
0

If you wanted to try this with dask, here's 2 possible approaches

Imports

import numpy as np
import pandas as pd
import dask.dataframe as ddd
from dask import delayed, compute
from dask.diagnostics import ProgressBar
import time

Define list of column names

h = [
    'phonetic',
    'letter',
    'ascii',
    'ebcedic',
    'baudot',
    'morse',
    'hollerith',
    'strokes',
    'kind'
    ]

Create a Dask DataFrame from the dictionary dd using (per this SO post)

def make_df(d):
    return pd.DataFrame.from_dict(d, orient='index')

dpd = [delayed(make_df)(dd)]
ddf = ddd.from_delayed(dpd)
ddf.columns = h
ddf.head()
           phonetic letter  ascii  ebcedic  baudot morse hollerith  strokes  kind
4296433290     Alfa      A     65      193       3    .-   (12, 1)        3  vowl
5046716526     Echo      E     69      197       1     .   (12, 5)        4  vowl
5000200584    India      I     73      201       6    ..   (12, 9)        3  vowl
5000971262    Oscar      O     79      214      24   ---   (11, 6)        1  vowl
5000921625  Uniform      U     85      228       7   ..-    (0, 4)        1  vowl

Get number of partitions in DataFrame

print(ddf.npartitions)
1
  • The 2 Dask methods below only work with one partition for the DataFrame.

Dask approach 1 - using .map_partitions

  • here, you define a helper function that performs the slice of the kind column,
%time
def slicer(df, kind):
    return df[df['kind']==kind]

ddf2 = ddf.map_partitions(slicer, 'cons', meta=ddf.head(1))
with ProgressBar():
    print(ddf2.reset_index().loc[slice(8-1,12-2)].compute().head())

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.82 µs
[########################################] | 100% Completed |  0.1s
         index  phonetic letter  ascii  ebcedic  baudot morse hollerith  strokes  kind
7   5035556831      Kilo      K     75      210      15   -.-   (11, 2)        3  cons
8   4296755665      Lima      L     76      211      18  .-..   (11, 3)        2  cons
9   5035557110      Mike      M     77      212      28    --   (11, 4)        4  cons
10  5037118125  November      N     78      213      12    -.   (11, 5)        3  cons

Dask approach 2 - using .loc

%time
with ProgressBar():
    print(ddf[ddf['kind'] == 'cons'].reset_index().loc[8-1:12-2].compute().head())

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.06 µs
[########################################] | 100% Completed |  0.1s
         index  phonetic letter  ascii  ebcedic  baudot morse hollerith  strokes  kind
7   5035556831      Kilo      K     75      210      15   -.-   (11, 2)        3  cons
8   4296755665      Lima      L     76      211      18  .-..   (11, 3)        2  cons
9   5035557110      Mike      M     77      212      28    --   (11, 4)        4  cons
10  5037118125  November      N     78      213      12    -.   (11, 5)        3  cons

Pandas

%time
df = pd.DataFrame.from_dict(dd, orient='index', columns=h)
print(df[df['kind']=='cons'].reset_index().loc[slice(8-1,12-2)].head())

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.82 µs
         index  phonetic letter  ascii  ebcedic  baudot morse hollerith  strokes  kind
7   5035556831      Kilo      K     75      210      15   -.-   (11, 2)        3  cons
8   4296755665      Lima      L     76      211      18  .-..   (11, 3)        2  cons
9   5035557110      Mike      M     77      212      28    --   (11, 4)        4  cons
10  5037118125  November      N     78      213      12    -.   (11, 5)        3  cons

EDIT

When I run @zero's approach from his answer, I get

%time
print(helper(dd,8,12,'cons'))
Wall time: 8.82 µs
edesz
  • 8,579
  • 17
  • 58
  • 103