Questions tagged [fuzzy-comparison]

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly).

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly). This problem is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.


Useful links


Related tags

307 questions
9
votes
4 answers

SQL and fuzzy comparison

Let's assume we have a table of People (name, surname, address, SSN, etc). We want to find all rows that are "very similar" to specified person A. I would like to implement some kind of fuzzy logic comparation of A and all rows from table People.…
running.t
  • 4,161
  • 2
  • 22
  • 45
8
votes
2 answers

fuzzy join with stringdist_join() in R, Error: NAs are not allowed in subscripted assignments

First of all I am sorry if my formatting is bad, this is my first time posting, (also new to programming & R) I am trying to merge two data frames together on string variables. I am merging university names, which might not match up perfectly, so I…
Brian
  • 93
  • 5
8
votes
3 answers

How to group / compare similar news articles

In an app that i'm creating, I want to add functionality that groups news stories together. I want to group news stories about the same topic from different sources into the same group. For example, an article on XYZ from CNN and MSNBC would be in…
Randy
  • 537
  • 1
  • 8
  • 19
8
votes
4 answers

Canonical URL compare in Python?

Are there any tools to do a URL compare in Python? For example, if I have http://google.com and google.com/ I'd like to know that they are likely to be the same site. If I were to construct a rule manually, I might Uppercase it, then strip off the…
Colin Davis
  • 572
  • 1
  • 6
  • 20
8
votes
0 answers

Fuzzy merging in R - seeking help to improve my code

Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance…
8
votes
2 answers

Comparing (similar) images with Python/PIL

I'm trying to calculate the similarity (read: Levenshtein distance) of two images, using Python 2.6 and PIL. I plan to us e the python-levenshtein library for fast comparison. Main question: What is a good strategy for comparing images? My idea is…
Attila O.
  • 13,553
  • 9
  • 51
  • 82
7
votes
3 answers

How can I find the best fit subsequences of a large string?

Say I have one large string and an array of substrings that when joined equal the large string (with small differences). For example (note the subtle differences between the strings): large_str = "hello, this is a long string, that may be made up of…
Josh Voigts
  • 3,830
  • 1
  • 16
  • 39
7
votes
1 answer

How to perform a fuzzy join with fuzzyjoin::difference_* in R

I'm working with two different datasets that I want to merge based on a threshold. Let's say the two dataframes look like this: library(dplyr) library(fuzzyjoin) library(lubridate) df1 = data_frame(Item=1:5, DateTime=c("2015-01-01…
brittenb
  • 5,849
  • 3
  • 30
  • 58
7
votes
3 answers

How to merge two pandas DataFrames based on a similarity function?

Given dataset 1 name,x,y st. peter,1,2 big university portland,3,4 and dataset 2 name,x,y saint peter3,4 uni portland,5,6 The goal is to merge on d1.merge(d2, on="name", how="left") There are no exact matches on name though. So I'm looking to do…
PascalVKooten
  • 18,070
  • 15
  • 82
  • 140
7
votes
3 answers

Fast way to match strings with typo

I have a huge list of strings (city-names) and I want to find the name of a city even if the user makes a typo. Example User types "chcago" and the system finds "Chicago" Of course I could calculate the Levenshtein distance of the query for all…
user2033412
  • 1,598
  • 1
  • 17
  • 40
7
votes
1 answer

Generate "fuzzy" difference of two files in Python, with approximate comparison of floats

I have an issue for comparing two files. Basically, what I want to do is a UNIX-like diff between two files, for example: $ diff -u left-file right-file However my two files contain floats; and because these files were generated on distinct…
6
votes
1 answer

Merge dataframes on multiple columns with fuzzy match in Python

I have two example dataframes as follows: df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 'Age': {0: 27, 1: 23, 2: 21}}) df2 =…
ah bon
  • 5,121
  • 5
  • 26
  • 65
6
votes
2 answers

Fuzzy record matching with multiple columns of information

I have a question that is somewhat high level, so I'll try to be as specific as possible. I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a…
6
votes
4 answers

SQL Fuzzy Join - MSSQL

I have two sets of data. Existing customers and potential customers. My main objective is to figure out if any of the potential customers are already existing customers. However, the naming conventions of customers across data sets are…
hansolo
  • 667
  • 3
  • 8
  • 19
6
votes
1 answer

The best way to search millions of fuzzy hashes

I have the spamsum composite hashes for about ten million files in a database table and I would like to find the files that are reasonably similar to each other. Spamsum hashes are composed of two CTPH hashes of maximum 64 bytes and they look like…
retrography
  • 4,972
  • 3
  • 16
  • 27
1
2
3
20 21