1

I have used material from here and a previous forum page to write some code for a program that will automatically calculate the semantic similarity between consecutive sentences across a whole text. Here it is;

The code for the first part is copy pasted from the first link, then I have this stuff below which I put in after the 245 line. I removed all excess after line 245.

with open ("File_Name", "r") as sentence_file:
    while x and y:
        x = sentence_file.readline()
        y = sentence_file.readline()
        similarity(x, y, true)           
#boolean set to false or true 
        x = y
        y = sentence_file.readline() 

My text file is formatted like this;

Red alcoholic drink. Fresh orange juice. An English dictionary. The Yellow Wallpaper.

In the end I want to display all the pairs of consecutive sentences with the similarity next to it, like this;

["Red alcoholic drink.", "Fresh orange juice.", 0.611],

["Fresh orange juice.", "An English dictionary.", 0.0]

["An English dictionary.", "The Yellow Wallpaper.",  0.5]

if norm(vec_1) > 0 and if norm(vec_2) > 0:
    return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1)* np.linalg.norm(vec_2))
 elif norm(vec_1) < 0 and if norm(vec_2) < 0:
    ???Move On???
Sigmund Reed
  • 77
  • 1
  • 8
  • `dict.has_key` has been deprecated for nearly a decade, now: https://docs.python.org/3.0/whatsnew/3.0.html#builtins –  Jan 11 '17 at 17:01
  • Sorry so is the the only problem and if so how can I fix it? Probably a stupid q. but I'm really new to Python. – Sigmund Reed Jan 11 '17 at 17:05
  • My previous comment contained a link. Click on the link. Look at the page contained therein. Read the bullet point about `dict.has_key()`. –  Jan 11 '17 at 17:06
  • Hint: what is meant by "`dict.has_key()` has been deprecated" is that you can no longer call the `has_key` method on a dictionary. Instead, use the `in` membership operator. https://docs.python.org/3/reference/expressions.html#membership-test-operations –  Jan 11 '17 at 17:13
  • Hi, I apologize but Python is still very new for me. I swapped hypernyms_2.has_key(lcs_candidate): for hypernyms_2.in(lcs_candidate): it said invalid syntax – Sigmund Reed Jan 11 '17 at 17:17
  • That's because `in` is an operator, not a method. Try `lcs_candidate in hypernyms_2` –  Jan 11 '17 at 17:19
  • Sorry again, I fixed that stuff (thank you so much) but then I get this. Look in the comments please. – Sigmund Reed Jan 11 '17 at 17:27
  • 1
    I suspect that's caused by dividing by zero somewhere... Also, cosine similarity is built in to SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine –  Jan 11 '17 at 17:30
  • What would you suggest to fix that mess? Preferably without using scipy and sticking to the code I have already. – Sigmund Reed Jan 11 '17 at 17:32
  • 1
    Check to make sure that neither of `vec_1` nor `vec_2` are the zero vector (ie have length zero) before calculating the cosine similarity. Just use `if`/`else`...ie if the norms of the vectors are both positive, then you're good to go, otherwise...well, skip that pair or throw an exception or...do what you want to do. –  Jan 11 '17 at 17:35
  • If you don't want to use SciPy to calculate the cosine similarity, then that's fine, too...calculating the dot product and dividing by the product of the norms works as well. Just make sure that both of the norms are positive. –  Jan 11 '17 at 17:37
  • Also, it's worth pointing out that you only got a warning, not an exception (ie your code kept going). Testing on my end indicates that `np.nan` (ie NumPy's `nan` value--`nan` meaning "not a number") would be returned when `vec_1` or `vec_2` have a norm of zero. –  Jan 11 '17 at 17:39
  • This is going to be really annoying but I'm a linguistics professor with minimum to no Python experience, how would this be done? I realize how sickening I am but I can't find any other help on short notice. Also nothing was returned not even nan. – Sigmund Reed Jan 11 '17 at 17:39
  • 1
    Well, what do you want to do if you encounter a vector with norm zero when computing the cosine similarities? Throw an error and quit? Silently continue with the next pair (assuming that you're computing these inside some `for` loop, which may or may not be the case)? That's not a question that I can answer. You have to decide the flow of logic for your code. –  Jan 11 '17 at 17:41
  • 1
    You can also just let the warnings be thrown and deal with the `nan` values in the output afterwards. –  Jan 11 '17 at 17:42
  • I tried something in the comments, it is obviously erroneous. Also don't know how to implement. – Sigmund Reed Jan 11 '17 at 17:59
  • Norms of vectors are never negative.... So, your `elif norm(vec_1) < 0 and if norm(vec_2) < 0:` can just be an `else:` –  Jan 11 '17 at 18:00
  • Also, `if norm(vec_1) > 0 and if norm(vec_2) > 0:` is invalid syntax. http://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/ifstatements.html –  Jan 11 '17 at 18:05
  • 1
    Incidentally, I don't know what you're using to write your code, but you might want to use an IDE (integrated development environment) or text editor with the ability to point out simple syntax errors. I'd recommend PyCharm: https://www.jetbrains.com/pycharm/ (there's a free and not-free edition...the free edition will be more than adequate for what you're trying to do). –  Jan 11 '17 at 18:09

1 Answers1

0

This should work. There's a few things to note in the comments. Basically, you can loop through the lines in the file and store the results as you go. One way to process two lines at a time is to set up an "infinite loop" and check the last line we've read to see if we've hit the end (readline() will return None at the end of a file).

# You'll probably need the file extention (.txt or whatever) in open as well
with open ("File_Name.txt", "r") as sentence_file:
    # Initialize a list to hold the results
    results = []

    # Loop until we hit the end of the file
    while True:
        # Read two lines
        x = sentence_file.readline()
        y = sentence_file.readline()

        # Check if we've reached the end of the file, if so, we're done
        if not y:
            # Break out of the infinite loop
            break
        else:
            # The .rstrip('\n') removes the newline character from each line
            x = x.rstrip('\n')
            y = y.rstrip('\n')

            try: 
                # Calculate your similarity value
                similarity_value = similarity(x, y, True)

                # Add the two lines and similarity value to the results list
                results.append([x, y, similarity_value])
            except:
                print("Error when parsing lines:\n{}\n{}\n".format(x, y))

# Loop through the pairs in the results list and print them
for pair in results:
    print(pair)

Edit: In regards to issues you're getting from similarity(), if you want to simply ignore the line pairs that are causing these errors (without looking at the source in depth I really have no idea what's going on), you can add a try, catch around the call to similarity().

Avantol13
  • 929
  • 10
  • 19