Questions tagged [string-matching]

String matching is the problem of finding occurrences of one string (“pattern”, “needle”) in another (“text”, “haystack”).

There are two types of string matching:

  • Exact
  • Approximate

Exact string matching is the problem of finding occurrence(s) of a pattern string within another string or body of text. (NIST). For example, finding CGATCGATTA in CTAGATCCTGCGATCGATTAAGCCTGA.

A comprehensive online reference of string matching algorithms is Exact String Matching Algorithms by Christian Charras and Thierry Lecroq.

Approximate string matching, also called fuzzy string matching, searches for matches based on the edit distance between the pattern and the text.

1969 questions
23
votes
13 answers

Search for string allowing for one mismatch in any location of the string

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000…
Vincent
  • 1,351
  • 4
  • 20
  • 37
22
votes
11 answers

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"…
Ash
21
votes
2 answers

XPath partial of attribute known

I know the partial value of an attribute in a document, but not the whole thing. Is there a character I can use to represent any value? For example, a value of a label for an input is "A. Choice 1". I know it says "Choice 1", but not whether it…
avaleske
  • 1,643
  • 5
  • 16
  • 26
21
votes
14 answers

Delete duplicate strings in string array

I am making a program based on string processing in Java in which I need to remove duplicate strings from a string array. In this program, the size of all strings are same. The 'array' which is a string array contains a number of strings in which…
user1339752
19
votes
6 answers

c# string comparison method returning index of first non match

Is there an exsting string comparison method that will return a value based on the first occurance of a non matching character between two strings? i.e. string A = "1234567890" string B = "1234567880" I would like to get a value back that would…
Andy
  • 405
  • 1
  • 4
  • 16
19
votes
5 answers

Python: optimal search for substring in list of strings

I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do: listStrings = [ACDE, CDDE, BPLL, ... ] listSubstrings = [ACD, BPI, KLJ, ...] The above entries…
Alopex
  • 303
  • 2
  • 8
19
votes
2 answers

Regex for existence of some words whose order doesn't matter

I would like to write a regex for searching for the existence of some words, but their order of appearance doesn't matter. For example, search for "Tim" and "stupid". My regex is Tim.*stupid|stupid.*Tim. But is it possible to write a simpler regex…
Tim
  • 1
  • 122
  • 314
  • 481
17
votes
4 answers

strstr faster than algorithms?

I have a file that's 21056 bytes. I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars. I've used all the implementations of the algorithms from…
Josh
  • 5,264
  • 7
  • 41
  • 80
17
votes
2 answers

Normalizing the edit distance

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two…
17
votes
5 answers

One of strings in array to match an expression

The Problem: I have an array of promises which is resolved to an array of strings. Now the test should pass if at least one of the strings matches a regular expression. Currently, I solve it using simple string…
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
17
votes
4 answers

Remove ends of string entries in pandas DataFrame column

I have a pandas Dataframe with one column a list of files import pandas as pd df = pd.read_csv('fname.csv') df.head() filename A B C fn1.txt 2 4 5 fn2.txt 1 2 1 fn3.txt .... .... I would like to delete the file…
ShanZhengYang
  • 12,508
  • 35
  • 106
  • 190
17
votes
2 answers

Fast partial string matching in R

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text. For small datasets, this can be easily done in R with grepl: patterns = c("some","pattern","a","horse") texts = c("this is a text…
Mulone
  • 3,433
  • 8
  • 42
  • 66
17
votes
2 answers

Using Rabin-Karp to search for multiple patterns in a string

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the…
MAK
  • 24,585
  • 9
  • 50
  • 82
15
votes
1 answer

tsql last "occurrence of" inside a string

I have got field containing comma separated values. I need to extract the last element in the list. I have tried with this: select list_field, LTRIM(RTRIM(right(list_field, len(list_field) - CHARINDEX(',',list_field)))) But it returns the last part…
Alberto De Caro
  • 4,849
  • 9
  • 42
  • 72
15
votes
2 answers

python - regex search and findall

I need to find all matches in a string for a given regex. I've been using findall() to do that until I came across a case where it wasn't doing what I expected. For example: regex = re.compile('(\d+,?)+') s = 'There are 9,000,000 bicycles in…
armandino
  • 14,813
  • 16
  • 65
  • 76