1

I'm fetching continuous updates from a site. Whenever I run my script I get an old_string, which is the string currently stored in my database. I also get a new_string which contains the current text body fetched from the site.

Is there a smart way to check which sentences of the new_string are not in old_string? To find which are the newest updates/changes and store that in newest_updates?

An example where I use --> x <-- to indicate new/modified string:

old_string = 
"Inbound restrictions:
The country’s airports closed to international flights on 18 March and will remain closed until 1 
April. The land and sea borders at this time remain open.
Travellers coming from Brazil, China, Dominican Republic, French Guiana, Italy, Iran, Jamaica, Japan, 
Malaysia, Panama, Singapore, South Korea, St Vincent, Thailand and the US should anticipate increased 
screenings upon arrival. There is also a possibility that these individuals would be denied entry 
into the country, according to government officials.
There are currently no known restrictions on individuals seeking to depart the country."

new_string = 
"Inbound restrictions:
The country’s airports closed to international flights on 18 March and will remain closed until -->5 
April<--. The land and sea borders at this time remain open.
Travellers coming from Brazil, China, Dominican Republic, French Guiana, Italy, Iran,-->Sweden<--, Jamaica, Japan, 
Malaysia, Panama, Singapore, South Korea, St Vincent, Thailand and the US should anticipate increased 
screenings upon arrival. There is also a possibility that these individuals would be denied entry 
into the country, according to government officials.
There are currently no known restrictions on individuals seeking to depart the country.-->

Outbound restrictions:
There are currently no known restrictions on individuals seeking to depart the country.<--"

From this the output would be :

 newest_updates = "The country’s airports closed to international flights on 18 March and will remain 
 closed until 5 April. 

 Travellers coming from Brazil, China, Dominican Republic, French Guiana, Italy, Iran,Sweden, 
 Jamaica, Japan, Malaysia, Panama, Singapore, South Korea, St Vincent, Thailand and the US should 
 anticipate increased screenings upon arrival

 Outbound restrictions:
 There are currently no known restrictions on individuals seeking to depart the country."

What would be the best way to do this? A suggestion is to use difflib. But with difflib I catch every sentence that is common in the two sentences, even if no changes have been made.

user4157124
  • 2,452
  • 12
  • 22
  • 36
kspr
  • 724
  • 2
  • 12
  • 1
    Does this answer your question? [Comparing two .txt files using difflib in Python](https://stackoverflow.com/questions/977491/comparing-two-txt-files-using-difflib-in-python) – Jongware Mar 27 '20 at 10:37
  • 1
    No, it does not work. Since difflib will also capture shared sentences which have not been modified. – kspr Mar 27 '20 at 11:52

1 Answers1

1

I would try it with the "in" condition:

First you should split your string at the end of each sentence:

new_strings = new_string.split(".")

From that point on I would search for sentences that do not match:

newest_updates = ""
for sentence in new_strings:
    if not sentence in old_string:
        newest_updates += sentence

Now you should have a variable with all updates.

leon52
  • 109
  • 10