2

I'm extracting instances of three elements from an XML file: ComponentStr, keyID, and valueStr. Whenever I find a ComponentStr, I want to add/associate the keyID:valueStr to it. ComponentStr values are not unique. As multiple occurrences of a ComponentStr is read, I want to accumulate the keyID:valueStr for that ComponentStr group. The resulting accumulated data structure after reading the XML file might look like this:

ComponentA: key1:value1, key2:value2, key3:value3

ComponentB: key4:value4

ComponentC: key5:value5, key6:value6

After I generate the final data structure, I want to sort the keyID:valueStr entries within each ComponentStr and also sort all the ComponentStrs.

I'm trying to structure this data in Python 2. ComponentStr seem to work well as a set. The keyID:valueStr is clearly a dict. But how do I associate a ComponentStr entry in a set with its dict entries?

Alternatively, is there a better way to organize this data besides a set and associated dict entries? Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr? After the data structure was built, I could sort it based on ComponentStr first, then perform some type of slice to group the keyID:valueStr and then sort again on the keyID? Seems complicated.

Aaron Hall
  • 291,450
  • 75
  • 369
  • 312
  • Steven, welcome to Stackoverflow, nice question by the way, remember to accept the answer that works the best for you by clicking the checkmark next to it, it will give you +2 to your rep. – Aaron Hall Jun 16 '14 at 04:47

2 Answers2

2

How about a dict of dicts?

data = {
'ComponentA': {'key1':'value1', 'key2':'value2', 'key3':'value3'},
'ComponentB': {'key4':'value4'},
'ComponentC': {'key5':'value5', 'key6':'value6'},
}

It maintains your data structure and mapping. Interestingly enough, the underlying implementation of dicts is similar to the implementation of sets.

This would be easily constructed a'la this pseudo-code:

data = {}
for file in files:
    data[get_component(file)] = {}
    for key, value in get_data(file):
        data[get_component(file)][key] = value

in the case where you have repeated components, you need to have the sub-dict as the default, but add to the previous one if it's there. I prefer setdefault to other solutions like a defaultdict or subclassing dict with a __missing__ as long as I only have to do it once or twice in my code:

data = {}
for file in files:
    for key, value in get_data(file):
        data.setdefault([get_component(file)], {})[key] = value

It works like this:

>>> d = {}
>>> d.setdefault('foo', {})['bar'] = 'baz'
>>> d
{'foo': {'bar': 'baz'}}
>>> d.setdefault('foo', {})['ni'] = 'ichi'
>>> d
{'foo': {'ni': 'ichi', 'bar': 'baz'}}

alternatively, as I read your comment on the other answer say you need simple code, you can keep it really simple with some more verbose and less optimized code:

data = {}
for file in files:
    for key, value in get_data(file):
        if get_component(file) not in data:
            data[get_component(file)] = {}
        data[get_component(file)][key] = value

You can then sort when you're done collecting the data.

for component in sorted(data):
    print(component)
    print('-----')
    for key in sorted(data[component]):
        print(key, data[component][key])
Aaron Hall
  • 291,450
  • 75
  • 369
  • 312
  • Thanks for the help! After posting the question, I took a walk to clear my head and during that I also came up with the dict of dicts approach.But before adding a ComponentX to the dict, won't I have to check to see if it already exists in the dict? Or can I simply add the {ComponentX:{keyID:valueStr}} entry to the dict and Python will handle it appropriately? ("Appropriately" in this case is: If ComponentX not in dict, add it. Then add {ComponentX:{keyID:valueStr}}.) – Steven Calwas Jun 16 '14 at 05:25
  • No, in the case where you'll have more than one, you'll need to use something like setdefault, a defaultdict, or subclass a dict with `__missing__` (see http://stackoverflow.com/questions/635483/what-is-the-best-way-to-implement-nested-dictionaries-in-python/19829714#19829714) . I'll explain in the answer with setdefault. – Aaron Hall Jun 16 '14 at 05:28
1

I want to accumulate the keyID:valueStr for that ComponentStr group

In this case you want to have the keys of your dictionary as the ComponentStr, accumulating to me immediately goes to a list, which are easily ordered.

Each keyID is unique. Perhaps I could have one dict of keyID:some combo of ComponentStr and valueStr?

You should store your data in a manner that is the most efficient when you want to retrieve it. Since you will be accessing your data by the component, even though your keys are unique there is no point in having a dictionary that is accessed by your key (since this is not how you are going to "retrieve" the data).

So, with that - how about using a defaultdict with a list, since you really want all items associated with the same component:

from collections import defaultdict

d = defaultdict(list)

with open('somefile.xml', 'r') as f:
   for component, key, value in parse_xml(f):
       d[component].append((key, value))

Now you have for each component, a list of tuples which are the associated key and values.

If you want to keep the components in the order that they are read from the file, you can use a OrderedDict (also from the collections module), but if you want to sort them in any arbitrary order, then stick with a normal dictionary.

To get a list of sorted component names, just sort the keys of the dictionary:

component_sorted = sorted(d.keys())

For a use case of printing the sorted components with their associated key/value pairs, sorted by their keys:

for key in component_sorted:
   values = d[key]
   sorted_values = sorted(values, key=lamdba x: x[0])  # Sort by the keys
   print('Pairs for {}'.format(key))
   for k,v in sorted_values:
       print('{} {}'.format(k,v)) 
Burhan Khalid
  • 152,028
  • 17
  • 215
  • 255
  • I don't think keeping a list of two-tuples of keys and values is as good as a dict. – Aaron Hall Jun 16 '14 at 04:27
  • Its better if the requirement is to sort, because dictionaries are unsorted. – Burhan Khalid Jun 16 '14 at 04:32
  • I would argue it's worse. I demonstrate the ease with which one may sort dicts, and the asker doesn't care about insertion order. – Aaron Hall Jun 16 '14 at 04:37
  • You are not sorting dicts, you are just sorting the keys, which is a list. – Burhan Khalid Jun 16 '14 at 04:38
  • That's right, because a dict is a mapping of keys to values. Sorted naturally returns a list, but accessing the values in the data structure by keys becomes much more difficult with your proposed method. – Aaron Hall Jun 16 '14 at 04:41
  • ...which is not a requirement of the OP in the first place :) The requirements is to be able to sort the keys, but not access them in a sorted manner. Perhaps for display purposes the OP wants to show values sorted by keys for each component. – Burhan Khalid Jun 16 '14 at 04:43
  • neither is maintaining insertion order, but let's agree to disagree. – Aaron Hall Jun 16 '14 at 04:45
  • Insertion order is not a requirement of the OP, so we may be arguing just because (well for me, its the first cup of tea). OrderedDict though would take care of that :P – Burhan Khalid Jun 16 '14 at 04:55
  • Thanks for the answer! I'm sure this would work. But the program I write will have to be understood by people who understand Python even less than I do, and that sort algorithm looks daunting. :-) Thanks again! – Steven Calwas Jun 16 '14 at 05:29