Why does 'the' survive after .remove?

Question

Something weird happens in this code:

fh = open('romeo.txt', 'r')
lst = list()

for line in fh:
    line = line.split()
    for word in line:
        lst.append(word)

for word in lst:
    numberofwords = lst.count(word)
    if numberofwords > 1:
        lst.remove(word)

lst.sort()

print len(lst)
print lst

romeo.txt is taken from http://www.pythonlearn.com/code/romeo.txt

Result:

27
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']

As you can see, there are two 'the'. Why is that? I can run this part of code again:

for word in lst:
    numberofwords = lst.count(word)
    if numberofwords > 1:
        lst.remove(word)

After running this code a second time it deletes the remaining 'the', but why doesn't it work the first time?

Correct output:

26
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

dunnno why it was edited you need to scroll now to see result — Gunnm, Jul 11 '15 at 11:11
My guess is that after a `.remove()`, the `for` loop is not looping correctly (since it probably cannot index the elements correctly anymore). — Jay Bosamiya, Jul 11 '15 at 11:12
Modifying a list while iterating can lead to undefined behaviour. — Peter Wood, Jul 11 '15 at 11:18
@PeterWood since Python doesn't have 'undefined behaviour' in the C/C++ sense, it might be better to say "modifying a list while iterating over it can lead to *strange results*". What actually happens is completely defined and deterministic - list iteration works by advancing an internal index, and removing an element before the current one causes everything to shift left one index, so you end up effectively skipping the value that was at the next index (since it is now at the current index instead). — lvc, Jul 11 '15 at 11:32
@Ivc thanks for answer! Now I understand what happened. It's exactly as you told. If you check line 'is' was just before 'the' and it were removed in both instances. Sentence: It is the east and Juliet is the sun — Gunnm, Jul 11 '15 at 11:35
http://stackoverflow.com/questions/2896752/removing-item-from-list-during-iteration-whats-wrong-with-this-idiom — Padraic Cunningham, Jul 11 '15 at 12:36
@lvc you're absolutely correct and I thought exactly what you have said as I typed my answer and hit return. It was sloppy of me, but I was in a hurry. I was going to say 'unexpected', but didn't have the time to expand. Apologies, and thanks for the clear correction. — Peter Wood, Jul 11 '15 at 21:32

score 14 · Accepted Answer · answered Jul 11 '15 at 11:13

14

In this loop:

for word in lst:
    numberofwords = lst.count(word)
    if numberofwords > 1:
        lst.remove(word)

lst is modified while iterating over it. Don't do that. A simple fix is to iterate over a copy of it:

for word in lst[:]:

answered Jul 11 '15 at 11:13

Yu Hao

111,229
40
211
267

It works. I still don't know exactly what's going on but thanks for quick response. – Gunnm Jul 11 '15 at 11:23
3

@Gunnm Remember not to modify a list while iterating over it for now. The reason behind it probably becomes clear when you learn more. – Yu Hao Jul 11 '15 at 11:35
Yeah @Ivc already cleared this out in comment below my question – Gunnm Jul 11 '15 at 11:42
1

`for word in reversed(lst):` would be avoid creating a new list – Padraic Cunningham Jul 11 '15 at 12:27
1

This would be a great answer if it explained *why* you shouldn't modify a list you're iterating. – Paul Phillips Jul 11 '15 at 15:41

score 6 · Answer 2 · edited Jul 11 '15 at 20:42

6

Python makes delicious tools available for making these kinds of tasks very easy. By using what is built-in, you can usually avoid the kinds of problems you're seeing with explicit loops and modifying the loop variable in-place:

with open('romeo.txt', 'r') as fh:
    words = sorted(set(fh.read().replace('\n', ' ').split(' ')))

print(len(words))
print(words)

edited Jul 11 '15 at 20:42

Peter Mortensen

28,342
21
95
123

answered Jul 11 '15 at 11:25

Caleb Hattingh

8,388
2
27
42

Thanks for sharing code! I'm still new to python so I'm struggilng even with most basic methods but it's nice to see how much you can improve simple code. – Gunnm Jul 11 '15 at 11:29
1

Nothing wrong with being a beginner! Note how easy it is to read what is happening in the code above. `.read()` slurps the file contents into one chunk of text. `.replace()` changes the newline characters into spaces. `.split()` breaks everything up into words (spaces). `set()` culls the list of words into uniques. `sorted()` sorts the set and returns a list (ordered). Hope that helps. – Caleb Hattingh Jul 11 '15 at 11:34

Why does 'the' survive after .remove?

2 Answers2