Remove nearly duplicate string from a list of strings using Difflib

Question

I am using python and mysql. Here is my code

cur.execute("SELECT distinct product_type FROM cloth_table")
Product_type_list = cur.fetchall()

Now Product_type_list is a list of strings describing the product_type like this

product_type_list =['T_shirts', 'T_shirt', 'T-shirt', 'Jeans', 'Jean', 'Formal Shirt' 'Shirt']

Here in product_type_list there is a 3 duplicate entry for t-shirt and 2 for each jeans and shirt.

Now i want my Product_type_list to be like this

Product_type_list=['T_shirt' , 'Jeans', 'Shirt']

I think we can use Difflib.Sequencematcher's quickratio. But how to do that

score 0 · Answer 1 · answered Jul 30 '13 at 06:31

0

I don't know much about Difflib.Sequencematcher package. But for this like a fuzzy match will be done by using MySql full text search concept.

Try to get the FTS matching logic and solve this issue. And also some Soundex concept is there in DB as well as Python.

Using FTS we get the comparision score like a rank, based on the rank we will filter our list. I done this like a similar task using SQL Server FTS.

answered Jul 30 '13 at 06:31

Deiveegaraja Andaver

612
6
3

Thanks for your answer , but i dont have any knowledge in Fts matching logic. can you provide me some links where i can get the knowledge – Binit Singh Jul 30 '13 at 06:58
Ya, as already said, I worked on this like a tasks in SQL Server. Find this link it may gives some FTS rank and soundex logics. http://msdn.microsoft.com/en-us/library/cc879245.aspx , http://msdn.microsoft.com/en-us/library/ms187384.aspx – Deiveegaraja Andaver Jul 30 '13 at 07:22

score 0 · Answer 2 · answered Jul 30 '13 at 06:34

0

I think, you could define your own algo to solve this as most of the stuff is domain dependent and your product types are not that big, I presume. For instance Formal in your formal shirt shall be ignored as per your requirement, and this may not be true in other domains. So define first off, your own stop words(words which can be ignored in product name) and remove ending 's' and trim white-spaces and '-', '_' kind of non letters and convert to Upper case. Given this, you could build your own matching algo to solve this problem. I had such a problem, and solve it with my own implementation after trying several existing libraries.

And you should keep on improving your algo as it is based on heuristics and assumptions.

answered Jul 30 '13 at 06:34

Karthikeyan

1,000
5
12

Thanks for the answer @karthikeyan But the solution you are suggesting like triming 's' '_' '-' or space is not the permanent solution because here i have given you the sample list of product_type to understand my problem in a better way, but in production there are lakhs of records and at that time i might not be knowing what are the differneces in two diiferent string but having similar meanings in english – Binit Singh Jul 30 '13 at 06:56
Given that, I would suggest a text processing engine like Lucene, where you could fit in the ideas suggested by me. And if you would have to use NLP stuff to identify similar or same meaning words. It is not simple too :) – Karthikeyan Jul 30 '13 at 07:10

Remove nearly duplicate string from a list of strings using Difflib

2 Answers2