6

I have a question that is somewhat high level, so I'll try to be as specific as possible.

I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.

My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.

My question is the following:

  • Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?

  • I'd prefer something that had practical examples and a well defined approach.

  • The approach could be iterative, with human checks for improvement at intermediate stages.

  • (edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".

  • If they were in Python that would be fantastic, but not necessary.

One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.

fgregg
  • 2,945
  • 27
  • 36
WildGunman
  • 363
  • 3
  • 13
  • How important is it to you to avoid erroneous matches/false positives? – Tim Mar 08 '11 at 20:02
  • Good question. I'd say _fairly_ important. I do the matching to perform statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-match". Ideally, I could adjust the aggressiveness. – WildGunman Mar 08 '11 at 20:19

2 Answers2

2

One common method that shouldn't be terribly expensive for "a few thousand records" would be cosine similarity. Although most often used for comparing text documents, you can easily modify it to work with any kind of data.

The linked Wikipedia article is pretty sparse on details, but following links and doing a few searches will get you some good info. Potentially an implementation that you can modify to fit your purposes. In fact, take a look at Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

A simpler calculation, and one that might be "good enough" for your purposes would be a Jaccard index. The primary difference is that typically cosine similarity takes into account the number of times a word is used in a document and in the entire set of documents, whereas the Jaccard index only cares that a particular word is in the document. There are other differences, but that one strikes me as the most important.

Community
  • 1
  • 1
Jim Mischel
  • 122,159
  • 16
  • 161
  • 305
0

The problem is that you have an array of distances, at least one for each column, and you want to combine those distances in an optimal way to indicate whether a pair of records are the same thing or not.

This is a problem of classification, there are many ways to do it, but logistic regression is one of simpler methods. To train a classifer, you will need to label some pairs of records as either matches or not.

The dedupe python library helps you do this and other parts of the difficult task of record linkage. The documentation has a pretty good overview of how to approach the problem of record linkage comprehensively.

fgregg
  • 2,945
  • 27
  • 36