Python: Semantic similarity score for Strings

Question

Are there any libraries for computing semantic similarity scores for a pair of sentences ?

I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like port-stemming, stop word removal, etc, on whole sentences and outputs a score for how related the two sentences are.

I found a work in progress that's written using the .NET framework that computes the score using an array of pre-processing steps. Is there any project that does this in python?

I'm not looking for the sequence of operations that would help me find the score (as is asked for here)
I'd love to implement each stage on my own, or glue functions from different libraries so that it works for sentence pairs, but I need this mostly as a tool to test inferences on data.

EDIT: I was considering using NLTK and computing the score for every pair of words iterated over the two sentences, and then draw inferences from the standard deviation of the results, but I don't know if that's a legitimate estimate of similarity. Plus, that'll take a LOT of time for long strings.
Again, I'm looking for projects/libraries that already implement this intelligently. Something that lets me do this:

import amazing_semsim_package
str1='Birthday party ruined as cake explodes'
str2='Grandma mistakenly bakes cake using gunpowder'

>>similarity(str1,str2)
>>0.889

Consider vector-base semantic models or matrix-decomposition models to compare sentence similarity. If not you can fall back on lesk-like cosine, that first vectorize a sentence the calculate the cosine between the 2 vectors — alvas, Jun 13 '13 at 13:17
if you are looking to weight something as a cutoff or desperately need the score, consider NLTK's wup similarity. You would need to use something like CLIPS pattern to get the type (verb,noun,adj;etc.). you can use that to actually find the perfect number of categories for LSA/LDA as found in gensim or a fuzzy/cosine implementation of Kmeans. — Andrew Scott Evans, Jul 09 '15 at 04:50

score 49 · Accepted Answer · edited Nov 11 '16 at 23:08

The best package I've seen for this is Gensim, found at the Gensim Homepage. I've used it many times, and overall been very happy with it's ease of use; it is written in Python, and has an easy to follow tutorial to get you started, which compares 9 strings. It can be installed via pip, so you won't have a lot of hassle getting it installed I hope.

Which scoring algorithm you use depends heavily on the context of your problem, but I'd suggest starting of with the LSI functionality if you want something basic. (That's what the tutorial walks you through.)

If you go through the tutorial for gensim, it will walk you through comparing two strings, using the Similarities function. This will allow you to see how your stings compare to each other, or to some other sting, on the basis of the text they contain.

If you're interested in the science behind how it works, check out this paper.

This looks very promising. Thank you for pointing this out Justin. — user8472, Jun 25 '13 at 10:21

3xCh1_23 · Answer 2 · 2014-12-08T19:05:55.803

5

Unfortunately, I cannot help you with PY but you may take a look at my old project that uses dictionaries to accomplish the Semantic comparisons between the sentences (which can later be coded in PY implementing the vector-space analysis). It should be just a few hrs of coding to translate from JAVA to PY. https://sourceforge.net/projects/semantics/

edited Dec 08 '14 at 19:05

answered Sep 26 '14 at 15:44

3xCh1_23

1,331
1
17
33

score -4 · Answer 3 · answered Jun 10 '13 at 12:04

-4

AFAIK the most powerfull NLP-Lib for Python is http://nltk.org/

answered Jun 10 '13 at 12:04

pypat

1,040
8
19

4

NLTK has some 6 scores for semantic similarity between a pair of word concepts, but I'm looking to compare two strings (of several, maybe hundreds of, words) – user8472 Jun 11 '13 at 05:40
not relevant to question – Kukesh Mar 27 '21 at 09:27

Python: Semantic similarity score for Strings

3 Answers3

Linked