8

Something weird happens in this code:

fh = open('romeo.txt', 'r')
lst = list()

for line in fh:
    line = line.split()
    for word in line:
        lst.append(word)

for word in lst:
    numberofwords = lst.count(word)
    if numberofwords > 1:
        lst.remove(word)

lst.sort()

print len(lst)
print lst

romeo.txt is taken from http://www.pythonlearn.com/code/romeo.txt

Result:

27
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']

As you can see, there are two 'the'. Why is that? I can run this part of code again:

for word in lst:
    numberofwords = lst.count(word)
    if numberofwords > 1:
        lst.remove(word)

After running this code a second time it deletes the remaining 'the', but why doesn't it work the first time?

Correct output:

26
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Gunnm
  • 842
  • 8
  • 17
  • dunnno why it was edited you need to scroll now to see result – Gunnm Jul 11 '15 at 11:11
  • 2
    My guess is that after a `.remove()`, the `for` loop is not looping correctly (since it probably cannot index the elements correctly anymore). – Jay Bosamiya Jul 11 '15 at 11:12
  • 4
    Modifying a list while iterating can lead to undefined behaviour. – Peter Wood Jul 11 '15 at 11:18
  • 6
    @PeterWood since Python doesn't have 'undefined behaviour' in the C/C++ sense, it might be better to say "modifying a list while iterating over it can lead to *strange results*". What actually happens is completely defined and deterministic - list iteration works by advancing an internal index, and removing an element before the current one causes everything to shift left one index, so you end up effectively skipping the value that was at the next index (since it is now at the current index instead). – lvc Jul 11 '15 at 11:32
  • @Ivc thanks for answer! Now I understand what happened. It's exactly as you told. If you check line 'is' was just before 'the' and it were removed in both instances. Sentence: It is the east and Juliet is the sun – Gunnm Jul 11 '15 at 11:35
  • 1
    http://stackoverflow.com/questions/2896752/removing-item-from-list-during-iteration-whats-wrong-with-this-idiom – Padraic Cunningham Jul 11 '15 at 12:36
  • @lvc you're absolutely correct and I thought exactly what you have said as I typed my answer and hit return. It was sloppy of me, but I was in a hurry. I was going to say 'unexpected', but didn't have the time to expand. Apologies, and thanks for the clear correction. – Peter Wood Jul 11 '15 at 21:32

2 Answers2

14

In this loop:

for word in lst:
    numberofwords = lst.count(word)
    if numberofwords > 1:
        lst.remove(word)

lst is modified while iterating over it. Don't do that. A simple fix is to iterate over a copy of it:

for word in lst[:]:
Yu Hao
  • 111,229
  • 40
  • 211
  • 267
6

Python makes delicious tools available for making these kinds of tasks very easy. By using what is built-in, you can usually avoid the kinds of problems you're seeing with explicit loops and modifying the loop variable in-place:

with open('romeo.txt', 'r') as fh:
    words = sorted(set(fh.read().replace('\n', ' ').split(' ')))

print(len(words))
print(words)
Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Caleb Hattingh
  • 8,388
  • 2
  • 27
  • 42
  • Thanks for sharing code! I'm still new to python so I'm struggilng even with most basic methods but it's nice to see how much you can improve simple code. – Gunnm Jul 11 '15 at 11:29
  • 1
    Nothing wrong with being a beginner! Note how easy it is to read what is happening in the code above. `.read()` slurps the file contents into one chunk of text. `.replace()` changes the newline characters into spaces. `.split()` breaks everything up into words (spaces). `set()` culls the list of words into uniques. `sorted()` sorts the set and returns a list (ordered). Hope that helps. – Caleb Hattingh Jul 11 '15 at 11:34