Questions tagged [fuzzy-comparison]

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly).

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly). This problem is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.


Useful links


Related tags

307 questions
71
votes
4 answers

Fuzzy String Comparison

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total…
jacksonstephenc
  • 711
  • 1
  • 6
  • 3
49
votes
4 answers

Techniques for finding near duplicate records

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!". My plan was to…
Richie Cotton
  • 107,354
  • 40
  • 225
  • 343
47
votes
6 answers

Fuzzy Regular Expressions

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV…
Thomas Ahle
  • 28,005
  • 19
  • 77
  • 105
34
votes
7 answers

How can I match fuzzy match strings from two datasets?

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had…
A L
  • 433
  • 1
  • 6
  • 7
29
votes
10 answers

Fuzzy regular expressions

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful. As an example, I want to match a string on the words "New York" preceded by a 2-digit…
itzy
  • 9,217
  • 13
  • 48
  • 88
21
votes
5 answers

How can I recognize slightly modified images?

I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are…
Eyal
  • 5,196
  • 7
  • 37
  • 66
18
votes
1 answer

elasticsearch fuzzy matching max_expansions & min_similarity

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I…
15
votes
2 answers

How to apply machine learning to fuzzy matching

Let's say that I have an MDM system (Master Data Management), whose primary application is to detect and prevent duplication of records. Every time a sales rep enters a new customer in the system, my MDM platform performs a check on existing…
blackgreen
  • 4,019
  • 8
  • 23
  • 41
13
votes
2 answers

Partitioning data on a variable to speed up "fuzzy match" using stringdist

I am building on an answer provided to a previous question about fuzzy matching using stringdist. I have two large datasets (~30k rows) with long strings (consumer product names) that I want to fuzzy match by generating a distance score. There is…
roody
  • 2,411
  • 5
  • 31
  • 47
12
votes
7 answers

Using pen strokes with fuzzy tolerance algorithm as encryption key

How can I encrypt/decrypt with fuzzy tolerance? I want to be able to use a Stroke on an InkCanvas as key for my encryption but when decrypting again the user should not have to draw the exact same symbol, only similar. Can this be done in .NET…
Andreas Zita
  • 6,150
  • 4
  • 39
  • 104
11
votes
1 answer

Joining two datasets using fuzzy logic

I’m trying to do a fuzzy logic join in R between two datasets: first data set has the name of a location and a column called config second data set has the name of a location and two additional attributes that need to be summarized before they are…
steppermotor
  • 661
  • 6
  • 19
10
votes
4 answers

Easiest way to compare two files with lists of song titles

I have two lists of song titles, each in a plain text file, which are the filenames of licensed lyric files - I want to check if the shorter list titles (needle) are in the longer list (haystack). The script/app should return the list of titles in…
pbhj
  • 266
  • 2
  • 12
10
votes
2 answers

Using MinHash to find similarities between 2 images

I am using MinHash algorithm to find similar images between images. I have run across this post, How can I recognize slightly modified images? which pointed me to MinHash algorithm. I was using a C# implementation from this blog post, Set Similarity…
dance2die
  • 31,758
  • 34
  • 122
  • 177
9
votes
1 answer

Fuzzy Matching Numbers

I've been working with Double Metaphone and Caverphone2 for String comparisons and they work good on things like names, addresses, etc (Caverphone2 is working best for me). However, they produce way too many false positives when you get to numeric…
JeffG
  • 597
  • 1
  • 7
  • 18
9
votes
1 answer

fuzzy matching in R

I am trying to detect matches between an open text field (read: messy!) with a vector of names. I created a silly fruit example that highlights my main challenges. df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6), entry = c("Apple", …
Eric Green
  • 6,401
  • 11
  • 41
  • 82
1
2 3
20 21