I have 2 lists of dictionaries (dictreaders) that look something like this:
Name1
[{'City' :'San Francisco', 'Name':'Suzan', 'id_number' : '1567', 'Street': 'Pearl'},
{'City' :'Boston', 'Name':'Fred', 'id_number' : '1568', 'Street': 'Pine'},
{'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'},
{'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}
{'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'}
{'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'}
{'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}
{'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}]
and
Name2
[{'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'},
{'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'},
{'City' :'Chicago', 'Name':'Lizy', 'id_number' : '1579', 'Street': 'Spruce'},
{'City' :'Denver', 'Name':'Bobby', 'id_number' : '1580', 'Street': 'Spruce'}
{'City' :'Chicago', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Spruce'}
{'City' :'Boston', 'Name':'Bob', 'id_number' : '1580', 'Street': 'Walnut'}]
If you notice the names in the second chunk are spelled differently than the first chunk but a few are nearly the same. I'd like to use fuzzy string matching to match these up. I'd also like to narrow to where I'm only comparing names that are in the same city and the on the same street. Currently I'm running a for loop that looks like this
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import izip_longest
import csv
name1_file = 'name1_file.csv'
node_file = 'name2_file.csv'
name1 = csv.DictReader(open(name1_file, 'rb'), delimiter=',', quotechar='"')
score_75_plus = []
name1_name =[]
name2_name =[]
name1_city = []
name2_city = []
name1_street = []
name2_street = []
name1_id = []
name2_id = []
for line in name1:
name2 = csv.DictReader(open(name2_file, 'rb'), delimiter=',', quotechar='"')
for line2 in name2:
if line['City'] == line2['City'] and line['Street'] == line['Street']:
partial_ratio = fuzz.partial_ratio(line['Name'], line2['Name'])
if partial_ratio > 75:
name1.append(line['Name'])
name1_city.append(line['City'])
name1_street.append(line['Street'])
name2_name.append(line2['Name'])
name2_city.append(line2['City'])
name2_street.append(line2['Street'])
score_75_plus.append(partial_ratio)
name1_id.append(line['objectid']
name2_id.append(line2['objectid']
big_test= zip(name1_name, name1_city, name1_street, name1_id, name2_name, name2_city, name2_street, name2_id, score_75_plus)
writer=csv.writer(open('big_test.csv', 'wb'))
writer.writerows(big_test)
However since my files are quite large I think its going to take quite some time... days perhaps. I'd like to make it more efficient but haven't figured out how to. So far my thinking is in restructuring the dictionaries into nested dictionaries to lessen the amount of data it has to loop through to check if the city and street are the same. I'm envisioning something like this :
['San Francisco' :
{'Pearl':
{'City' :'San Francisco', 'Name':'Szn', 'id_number' : '1567', 'Street': 'Pearl'} },
'Boston' :
{'Pine':
{'City' :'Boston', 'Name':'Frd', 'id_number' : '1578', 'Street': 'Pine'},
'Canyon': {'City' :'Boston', 'Name':'Bob', 'id_number' : '1572', 'Street': 'Canyon'} },
'Chicago' :
{'Spruce':
{'City' :'Chicago', 'Name':'Lizzy', 'id_number' : '1569', 'Street': 'Spruce'},
{'City' :'Chicago', 'Name':'Bob', 'id_number' : '1571', 'Street': 'Spruce'} },
'Denver' :
{'Spruce':
{'City' :'Denver', 'Name':'Bob', 'id_number' : '1570', 'Street': 'Spruce'}},
'Boulder':
{'Violet':
{'City' :'Boulder', 'Name':'Diana', 'id_number' : '1573', 'Street': 'Violet'}},
'Detroit':
{'Grape':
{'City' :'Detroit', 'Name':'Bill', 'id_number' : '1574', 'Street': 'Grape'}}]
This it would only have to look through the distinct cities and distinct streets within that city to decide whether to apply fuzz.partial_ratio. I used defaultdict to split it up by city but haven't been able to apply it again for streets.
city_dictionary = defaultdict(list)
for line in name1:
city_dictionary[line['City']].append(line)
I've looked at this answer but didn't understand how to implement it.
Sorry for so much detail, I'm not totally sure nested dictionaries are the way to go so I thought I would present the big picture.