0

Junior python programmer here and I've been beating my head against a brick wall on unexpected for loop and dictionary behavior. I'm looping through a CSV file of log entries and parsing the data into a categories dict. When I initialize the categories dict each time through the loop, it works as expected..

Like so:

log_entries = AutoVivification()
# http://stackoverflow.com/questions/635483/what-is-the-best-way-to-implement-nested-dictionaries-in-python

def scrublooper(log_file):

    for ll in log_file:
    # Initialize  categories dict every round through the loop
    categories = {'requests': {'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 0, 'Pages': 0, 'Content_Files': 0}, 'filter_action': {'re': 0, 'pl': 0, 'bs': 0}}
    lld = LogDomain(ll)
    domain, hostname, lan_host = lld.domain, lld.hostname, lld.lan_host


    mimetypes = url_searcher(Settings.mimetypes, lld.mime_type)

    if mimetypes:
        category = mimetypes[2]

        if not log_entries[lan_host].has_key(domain): 
            log_entries[lan_host][domain]= categories

        log_entries[lan_host][domain]['requests'][category] += 1 

print log_entries['192.168.5.210']['google.com']['requests']
print log_entries['192.168.5.210']['webtrendslive.com']['requests']
print log_entries['192.168.5.210']['osnews.com']['requests']
print log_entries['192.168.5.210']['question-defense.com']['requests']
print log_entries['192.168.5.210']['optimost.com']['requests']

Output from this look is what I would expect:

{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 95, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 1, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 2, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 18, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 3, 'Pages': 0, 'Content_Files': 0}

HOWEVER! Here is my problem. I don't want to initialize the categories dict every time through the loop. In this simplified example case it doesn't matter, but down the road for this program, it'll cause significant performance degradation (30%).

I need to initialize the categories dict ONCE:

log_entries = AutoVivification()
categories = {'requests': {'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 0, 'Pages': 0, 'Content_Files': 0}, 'filter_action': {'re': 0, 'pl': 0, 'bs': 0}}

def scrublooper(log_file):

    for ll in log_file:
    lld = LogDomain(ll)
    # etc, etc, etc

However, when I initialize the categories dict ANYWHERE outside the for loop (whether in the scrublooper function or simply right after the log_entries variable), the output is:

{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 685, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 685, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 685, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 685, 'Pages': 0, 'Content_Files': 0}
{'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 685, 'Pages': 0, 'Content_Files': 0}

All 'Conent_Text' values have incremented equally! What is happening here? I'm sure I've violating some python principle but don't know what or how to find out. It took me hours simply to figure out the problem was connected to the categories dict.

Much obliged for any explanation.

Alex Naspo
  • 1,834
  • 1
  • 17
  • 37
Thinkwell
  • 275
  • 1
  • 5
  • 15
  • So where is `categories` actually being manipulated? I only ever see it being initialized and being read from. Did you simplify this a little too far for SO? – Martijn Pieters Jun 05 '12 at 15:03
  • I don't think it's simplified too far. It's always assigned, never manipulated. Assigned to a `log_entries` dict key/value: `if not log_entries[lan_host].has_key(domain): ` `log_entries[lan_host][domain]= categories` – Thinkwell Jun 05 '12 at 15:09

1 Answers1

2

I'm not familiar with the tools you're using, but when you create the dictionary outside of the loop, you're just creating one dictionary.

if not log_entries[lan_host].has_key(domain): 
        log_entries[lan_host][domain]= categories

This code just makes log_entries[lan_host][domain] point to that single dictionary. Python doesn't copy the values or anything like that. So these lines refer to the same dictionary.

log_entries['192.168.5.210']['google.com']
log_entries['192.168.5.210']['webtrendslive.com']

P.S. I can't say for sure, but my gut says that not wanting to initialize a new dictionary for performance is probably excessive.

Justin Blank
  • 1,528
  • 1
  • 12
  • 27
  • Hmm, Um, Okay... I'm wanting to extend the `log_entries[lan_host][domain]` dict to include the structure of the `categories` dictionary, and then manipulate the `log_entries` instance. How would I change the code to create that? (The Autovivification function simply automatically adds keys if they didn't exist. I thought it was adding the `categories` structure instead of referencing the variable). – Thinkwell Jun 05 '12 at 15:18
  • Ok, when I change: `log_entries[lan_host][domain]= categories` To: `log_entries[lan_host][domain]= {'requests': {'Content_Visual': 0, 'Content_ProgramsUpdates': 0, 'Content_Text': 0, 'Pages': 0, 'Content_Files': 0}, 'filter_action': {'re': 0, 'pl': 0, 'bs': 0}}` Then it works. So apparently my code was referencing the `categories` variable instead of extending the dict. I need to use the variable syntax instead of the dictionary, however. I guess I need to experiment with syntax to do that. – Thinkwell Jun 05 '12 at 15:24
  • I should have been clearer. If you create the dictionary inside the loop, it will create a new dictionary each time through. Each log_entries[lan_host][domain] then is unique. That's why your code works correctly when you do it that way. – Justin Blank Jun 05 '12 at 15:27
  • Yes, I'm following you. Thanks for your help so far; the problem becomes clearer. What I need to do is extend the log_entries[lan_host][domain] to include the dict structure that `categories` represents - not reference `categories` itself. Can I do that without typing out the complete dict syntax? – Thinkwell Jun 05 '12 at 15:43
  • If you want to copy the keys and values, you could use [update](http://docs.python.org/library/stdtypes.html#dict.update). – Justin Blank Jun 05 '12 at 15:50
  • Thanks a million. I'm off to the races. Did end up using [fromkeys](http://docs.python.org/library/stdtypes.html#dict.fromkeys) instead of update so I can force default values! – Thinkwell Jun 05 '12 at 16:53