Questions tagged [difflib]

A python module, provides tools for computing and working with differences between sequences, especially useful for comparing text. Includes functions that produce reports using several common difference formats.

A python module which provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs.

271 questions
7
votes
1 answer

Is it possible that the SequenceMatcher in Python's difflib could provide a more efficient way to calculate Levenshtein distance?

Here's the textbook example of the general algorithm to calculate Levenshtein Distance (I've pulled from Magnus Hetland's webite): def levenshtein(a,b): "Calculates the Levenshtein distance between a and b." n, m = len(a), len(b) if n >…
damzam
  • 1,809
  • 14
  • 17
6
votes
2 answers

making difflib's SequenceMatcher ignore "junk" characters

I have a lot of strings that i want to match for similarity(each string is 30 characters on average). I found difflib's SequenceMatcher great for this task as it was simple and found the results good. But if i compare hellboy and hell-boy like…
lovesh
  • 4,813
  • 7
  • 55
  • 89
6
votes
0 answers

Ignoring whitespace in a python diff

Is there an elegant way to ignore whitespace in a diff in python (using difflib, or any other module)? Maybe I missed something, but I've scoured the documentation, and was unable to find any explicit support for this in difflib. My current solution…
Max Wallace
  • 3,185
  • 26
  • 41
6
votes
2 answers

Approximate string matching of author names - modules and strategies

I've created a small program that checks if authors are present in a database of authors. I haven't been able to find any specific modules for this problem, so I'm writing it from scratch using modules for approximate string matching. The database…
Misconstruction
  • 1,399
  • 4
  • 14
  • 22
5
votes
1 answer

In python, produce HTML highlighting the differences of two simple strings

I need to highlight the differences between two simple strings with python, enclosing the differing substrings in a HTML span attribute. So I'm looking for a simple way to implement the function illustrated by the following…
user1069609
  • 821
  • 3
  • 15
  • 30
5
votes
2 answers

SequenceMatcher - finding the two most similar elements of two or more lists of data

I was trying to compare a set of strings to an already defined set of strings. For example, you want to find the addressee of a letter, which text is digitalized via OCR. There is an array of adresses, which has dictionaries as elements. Each…
valerius21
  • 301
  • 1
  • 12
5
votes
2 answers

difflib.SequenceMatcher isjunk argument not considered?

In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is? Why does the isjunk argument seem to not make any difference in this case? difflib.SequenceMatcher(None, "AA", "A…
bluelogic
  • 51
  • 3
5
votes
2 answers

how to get multiple matches with difflib.SequenceMatcher?

I am using difflib to identify all the matches of a short string in a longer sequence. However it seems that when there are multiple matches, difflib only returns one: > sm = difflib.SequenceMatcher(None, a='ACT', b='ACTGACT') >…
dalloliogm
  • 7,737
  • 5
  • 40
  • 55
5
votes
3 answers

Get close string matches considering deletion - python

Is there a way to let difflib consider deletion in string matching? I've tried the difflib.get_close_matches() but it doesn't consider strings with lower length in the close matches output. E.g. from difflib import get_close_matches as gcm x =…
alvas
  • 94,813
  • 90
  • 365
  • 641
5
votes
0 answers

Can difflib's charjunk be used to ignore whitespace?

I'd like to compare differences between two lists of strings. For my purposes, whitespace is noise and these differences do not need to be shown. Reading into difflib's documentation, "the default [for charjunk] is module-level function…
Mike T
  • 34,456
  • 15
  • 128
  • 169
4
votes
2 answers

Python difflib gnu patch compatibility

It's possible to create patch with python module difflib which is compatible with GNU patch? I tried to use unified_diff and context_diff and also tried to specify lineterm as "\n" but I'm still gettings this error: [intense@Singularity Desktop]$…
intense
  • 197
  • 1
  • 8
4
votes
3 answers

Python Difflib Deltas and Compare Ndiff

I was looking to do something like what I believe change control systems do, they compare two files, and save a small diff each time the file changes. I've been reading this page: http://docs.python.org/library/difflib.html and it's not sinking in…
NealWalters
  • 14,090
  • 34
  • 109
  • 199
4
votes
1 answer

What is the standard way to represent subsequent changes in a text and to work with this representation using Python?

Assume that I have some text (for example given as a string). Later I am going to "edit" this text, which means that I want to add something somewhere or remove something. In this way I will get another version of the text. However, I do not want to…
Roman
  • 97,757
  • 149
  • 317
  • 426
4
votes
3 answers

Python's difflib SequenceMatcher speed up

I'm using difflib SequenceMatcher (ratio() method) to define similarity between text files. While difflib is relatively fast to compare a small set of text files e.g. 10 files of 70 kb on average comparing to each other (46 comparisons) takes about…
user734094
4
votes
3 answers

Determine where documents differ with Python

I have been using the Python difflib library to find where 2 documents differ. The Differ().compare() method does this, but it is very slow - atleast 100x slower for large HTML documents compared to the diff command. How can I efficiently determine…
hoju
  • 24,959
  • 33
  • 122
  • 169
1
2
3
18 19