3

Let's say I have two strings that contain similar (but not identical) substrings:

A = """Here is a test of a sentence with a few words in it. 
The rest of this sentence is different, though."""

B = """And here is a test of a sent;ence with a few wordz in it, 
as well. The quick brown fox jumped over the lazy dogs."""

How can I get the similar text between them, i.e. "Here is a test of a sentence with a few words in it" in A and "here is a test of a sent;ence with a few wordz in it" in B?

Edit: as far as I know, this isn't the same thing as calculating an edit distance. Sure, I can calculate an edit distance between "sentence" and "sent;ence", but that doesn't help me to identify the matching substrings.

Jonathan
  • 9,074
  • 8
  • 45
  • 80
  • 2
    What ideas do you have on this, and what have you tried? – Rory Daulton Oct 22 '16 at 16:23
  • 2
    Have you checked out [difflib](https://docs.python.org/3/library/difflib.html) to see if it meets your use-case? – idjaw Oct 22 '16 at 16:24
  • 4
    Start from [here](http://stackoverflow.com/questions/18715688/find-common-substring-between-two-strings) and end up [here](http://stackoverflow.com/q/17388213/4099593). – Bhargav Rao Oct 22 '16 at 16:25
  • There is no easy way to achieve this. You need to write your algorithm. Split the string into list of words. Check for common sub-strings differing only by one intermediate words (or whatever your condition is). – Anonymous Oct 22 '16 at 16:26
  • Is it supposed to be `sent;ence` (in B and solution) and `wordz` (in B)? Or just a typo? – jatinderjit Oct 22 '16 at 16:30
  • @jatinderjit That is the point of the analysis, I believe. If the sentences end up in a state where they are *almost* similar, you still want to find similarities to a "as-close-as-possible" match. – idjaw Oct 22 '16 at 16:31
  • The general approach is to [tokenize](https://en.wikipedia.org/wiki/Lexical_analysis), then compare token streams. but this is a very broad problem. – Tore Eschliman Oct 22 '16 at 16:38
  • You could also look at edit distance: https://en.wikipedia.org/wiki/Edit_distance – Kenny Ostrom Oct 22 '16 at 16:40
  • I don't see how calculating edit distance is the same thing as finding similar substrings. Is the implication that I compare the edit distance between all the words in both strings? And when I have those distances, how can I find the matching substrings? – Jonathan Oct 22 '16 at 17:28
  • When you same "similar substrings" do you mean that words or characters are the smallest unit of similarity? – intrepidhero Oct 22 '16 at 17:36
  • What criteria considers the substring to be equal? Also are you looking for the longest match or what exactly? – Padraic Cunningham Oct 22 '16 at 17:37
  • @intrepidhero, I guess words? – Jonathan Oct 22 '16 at 17:45
  • You need to add the criteria for what you consider a close enough match, if it is sequences of words that are the same with just one different word it is easy to simplify the problem by using the indexes of common words and finding the longest sequence with a break of just one index or less in the sequence. You could also apply a hamming distance on the words that break a sequence to get a closer match – Padraic Cunningham Oct 22 '16 at 17:52
  • Do the words have to be in the same order to match? – intrepidhero Oct 22 '16 at 18:06
  • If your question is for solving some wildcard cases based on any exist algorithm calculating, I think the topic is duplicated. Because the problem scope is still same. And also, the description of this question is not clear. – Kir Chou Oct 22 '16 at 18:14
  • @intrepidhero, yes, the words have to be in the same order, since the idea is to match basically the same strings. – Jonathan Oct 22 '16 at 19:07
  • @KirChou, My problem is not how to calculate a particular edit distance (there are plenty of guides for that), but for finding matching substrings in strings. – Jonathan Oct 22 '16 at 19:19
  • The linked answer does not relate to the question asked. This is much closer: http://stackoverflow.com/questions/18715688/find-common-substring-between-two-strings – intrepidhero Oct 22 '16 at 20:36
  • Edit distance is necessary, but not sufficient to answer the question. Use the edit distance to convert each word into a `set` of words within a distance of 1. You'll probably still have to do the brute-force comparison outlined in the sequence comparison answers above, but instead of checking equality, check for set intersection. – Tore Eschliman Oct 22 '16 at 21:00

0 Answers0