0

I have 2 lists of dictionaries (dictreaders) that look something like this:

Name1

 [{'City' :'San Francisco', 'Name':'Suzan', 'id_number' : '1567', 'Street': 'Pearl'},
     {'City' :'Boston', 'Name':'Fred', 'id_number' : '1568', 'Street': 'Pine'},
     {'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'},
     {'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}
     {'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'}
     {'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'}
     {'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}
     {'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}]

and

Name2

[{'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'},
 {'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'},
 {'City' :'Chicago', 'Name':'Lizy', 'id_number' : '1579', 'Street': 'Spruce'},
 {'City' :'Denver', 'Name':'Bobby', 'id_number' : '1580', 'Street': 'Spruce'}
 {'City' :'Chicago', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Spruce'}
 {'City' :'Boston', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Walnut'}]

If you notice the names in the second chunk are spelled differently than the first chunk but a few are nearly the same. I'd like to use fuzzy string matching to match these up. I'd also like to narrow to where I'm only comparing names that are in the same city and the on the same street. Currently I'm running a for loop that looks like this

from fuzzywuzzy import fuzz
from  fuzzywuzzy import process
from itertools import izip_longest
import csv

name1_file = 'name1_file.csv'
node_file = 'name2_file.csv'

name1 = csv.DictReader(open(name1_file, 'rb'), delimiter=',', quotechar='"')


score_75_plus = []
name1_name =[]
name2_name =[]
name1_city = []
name2_city = []
name1_street = []
name2_street = []
name1_id = []
name2_id = []


for line in name1:
    name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
    for line2 in name2:
        if line['City'] == line2['City'] and line['Street'] == line['Street']:
            partial_ratio = fuzz.partial_ratio(line['Name'], line2['Name'])
            if partial_ratio > 75:
                name1.append(line['Name'])
                name1_city.append(line['City'])
                name1_street.append(line['Street'])
                name2_name.append(line2['Name'])
                name2_city.append(line2['City'])
                name2_street.append(line2['Street'])
                score_75_plus.append(partial_ratio)
                name1_id.append(line['objectid']
                name2_id.append(line2['objectid']

big_test= zip(name1_name, name1_city, name1_street, name1_id, name2_name, name2_city, name2_street, name2_id, score_75_plus)
writer=csv.writer(open('big_test.csv', 'wb'))
writer.writerows(big_test)

However since my files are quite large I think its going to take quite some time... days perhaps. I'd like to make it more efficient but haven't figured out how to. So far my thinking is in restructuring the dictionaries into nested dictionaries to lessen the amount of data it has to loop through to check if the city and street are the same. I'm envisioning something like this :

['San Francisco' : 
    {'Pearl': 
        {'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'} }, 
'Boston' : 
    {'Pine': 
        {'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'}, 
    'Canyon': {'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'} },
'Chicago' : 
    {'Spruce': 
        {'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'}, 
        {'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'} },
'Denver' :
    {'Spruce':
        {'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}},
'Boulder':
    {'Violet': 
        {'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}},
'Detroit':
    {'Grape': 
        {'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}}]

This it would only have to look through the distinct cities and distinct streets within that city to decide whether to apply fuzz.partial_ratio. I used defaultdict to split it up by city but haven't been able to apply it again for streets.

city_dictionary = defaultdict(list)

for line in name1:
    city_dictionary[line['City']].append(line)

I've looked at this answer but didn't understand how to implement it.

Sorry for so much detail, I'm not totally sure nested dictionaries are the way to go so I thought I would present the big picture.

Community
  • 1
  • 1

1 Answers1

0

Several things you can do:

  1. Don't re-open the second file for every line in the first csv file. Use a data structure (probably a list) to store all the information and use it in-memory.
  2. Profile your code to see where it spends most of it's time.
  3. If most of the time is spent on CPU, use the multiprocessing module to use all cores on your machine, since tasks here seem context-free.
  4. If most of the time is spent reading files (I/O), use the threading module to do some processing while reading files. You may want to break your files to smaller chunks.

If this does not help I can try and add more things based on your code then.

Edit:

Example of reading the second file once, instead of re-reading for each line in first file:

# read file once before the loop
file_2_dicts = list(csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"'))
for line in name1:
    # remove old read and use in-memory dicts from first file
    # name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
    name2 = file_2_dicts
    for line2 in name2:
        ...
        ...

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
4550    0.055    0.000    0.066    0.000 csv.py:100(next)
9098    0.006    0.000    0.006    0.000 csv.py:86(fieldnames)
4497409    3.845    0.000   54.221    0.000 difflib.py:154(__init__)
4497409    3.678    0.000   50.377    0.000 difflib.py:223(set_seqs)
4497409    3.471    0.000    3.471    0.000 difflib.py:235(set_seq1)
4497409    3.695    0.000   43.228    0.000 difflib.py:261(set_seq2)
4497409   29.130    0.000   39.533    0.000 difflib.py:306(__chain_b)
13356323   78.759    0.000  100.599    0.000 difflib.py:350(find_longest_match)
3123327    1.398    0.000    1.398    0.000 difflib.py:41(_calculate_ratio)
4497409   36.080    0.000  164.628    0.000 difflib.py:460(get_matching_blocks)
3123327    7.450    0.000  128.167    0.000 difflib.py:636(ratio)
7500936    1.673    0.000    1.673    0.000 difflib.py:658(<lambda>)
1374082   16.978    0.000  252.893    0.000 fuzz.py:57(partial_ratio)
1374082    1.172    0.000    1.647    0.000 utils.py:42(make_type_consistent)
3123327    2.587    0.000    4.260    0.000 {_functools.reduce}
23980904    7.633    0.000    7.633    0.000 {built-in method __new__ of type object at 0x100185f40}
4497410    6.525    0.000   16.009    0.000 {map}
1373764    0.496    0.000    0.496    0.000 {max}
32176130    3.231    0.000    3.231    0.000 {method '__contains__' of 'set' objects}
61813598    9.676    0.000    9.676    0.000 {method 'append' of 'list' objects}
72656176    7.728    0.000    7.728    0.000 {method 'get' of 'dict' objects}
13356323    5.311    0.000    5.311    0.000 {method 'pop' of 'list' objects}
33073067    4.927    0.000    4.927    0.000 {method 'setdefault' of 'dict' objects}
4497409    1.568    0.000    1.568    0.000 {method 'sort' of 'list' objects}
Reut Sharabani
  • 27,609
  • 5
  • 62
  • 82
  • So the reason the dictreader for second file is within the first for loop is [link](http://stackoverflow.com/questions/27773403/looping-through-and-comparing-lines-of-two-unequal-length-dictionaries). The files are unequal in length and this was the only solution to compare all the names in each file. Currently the second file is the shorter one would it be quicker to reverse that order? I've never used 2, 3, and 4 so I'll look into those now. Thank you! – Allison Madigan Jan 09 '15 at 16:13
  • Also with regard to saving it in a list, I need to output a file that also has the object id for use later. – Allison Madigan Jan 09 '15 at 16:15
  • @AllisonMadigan I assume you mean `name2` instead of `pnodes` in your code. If that's true, what's wrong with saving a list containing all dicts in `pnodes` and then using the list instead of re-opening the file on each iteration? I you fix the naming in your code I can post an example – Reut Sharabani Jan 09 '15 at 18:05
  • I think I fixed all the naming, sorry about that I thought i had got everything before! An example would be great, I think I know what you mean but I'm not positive. – Allison Madigan Jan 09 '15 at 19:34
  • 1
    Thanks! I just started it, hopefully there will be results soon! – Allison Madigan Jan 09 '15 at 20:24
  • @AllisonMadigan , how big are your files? – Reut Sharabani Jan 09 '15 at 20:26
  • file_1 is 5.9 MB for a total of 146,682 rows and 4 columns. file_2 is 2 MB for a total of 42,259 rows and 6 columns. – Allison Madigan Jan 12 '15 at 15:46
  • I've been running tests on smaller sets before I start the big one to make sure the rest of my code is good. But it has been taking a few minutes to run it on file_1_test at 55 KB and file_2_test at 138 KB. In the end it doesn't need to go super super fast because I really only need to run it well once. But I'm testing on smaller data sets to avoid waiting a day to find out I forgot some silly little thing. – Allison Madigan Jan 12 '15 at 16:56
  • @AllisonMadigan good idea. on such small sets it should be under a few seconds. Can you do some profiling to see what takes most of the time? (for example time each iteration) – Reut Sharabani Jan 12 '15 at 17:00
  • I ran cProfile and posted the results as an edit for processes that were called more than once. Unfortunately, I think the slowness might be related to the calls to difflib which the module I'm using, fuzzywuzzy, relies heavily on... – Allison Madigan Jan 12 '15 at 18:13
  • @AllisonMadigan do you have a lot of repeating calls? Are ratios reflexive? `ratio("Alice", "Bob") = ratio("Bob", "Alice")` ? If so, you can memoize and get a speed boost. – Reut Sharabani Jan 12 '15 at 22:42
  • The ratios are reflexive but looking at the results it doesn't look like its checking both ways. Is it easy to memoize? – Allison Madigan Jan 13 '15 at 17:39
  • Yes. You need a dictionary mapping from the arguments to the value. I'm on my mobile but I can add it to the answer later – Reut Sharabani Jan 13 '15 at 17:56