6

I'm using Python 3.8.3 & I got some unexpected output like below when checking id of strings.

>>> a="d"
>>> id(a)
1984988052656
>>> a+="e"
>>> id(a)
1985027888368
>>> a+="h"
>>> id(a)
1985027888368
>>> a+="i"
>>> id(a)
1985027888368
>>> 

After the line which adding "h" to a, id(a) didn't change. How is that possible when strings are immutable ? I got this same output when I use a=a+"h" instead of a+="h" and run this code in a .py file also(I mentioned that because there is some situations we can see different output when running in the shell and running same code after save to a file)

Chamod
  • 369
  • 2
  • 11
  • There's a weird string concatenation optimization that does this. It violates the rules of how ID values and `+=` are supposed to work - the ID values produced with the optimization in place would be not only impossible, but prohibited, with the unoptimized semantics - but the developers care more about people who would see bad concatenation performance and assume Python sucks. – user2357112 supports Monica Jun 28 '20 at 10:47
  • @user2357112supportsMonica Why does it violates the rules? Could you elaborate? – Holt Jun 28 '20 at 10:49
  • __Two objects with non-overlapping lifetimes may have the same `id()` value.__ referenced from [here](https://www.geeksforgeeks.org/id-function-python/#:~:text=id()%20is%20an%20inbuilt%20function%20in%20Python.&text=As%20we%20can%20see%20the,the%20same%20id()%20value.) Hope that helps. – Xinthral Jun 28 '20 at 10:49
  • @Jesse: Yes, but with `+=` on immutable objects, the old and new values are supposed to have overlapping lifetime. The new value is supposed to be created before the assignment happens, and the old value's lifetime only ends once the new value is assigned to the variable. – user2357112 supports Monica Jun 28 '20 at 10:51
  • Also: [Python strings are not immutable?](https://stackoverflow.com/q/32911533/7851470) – Georgy Jun 30 '20 at 12:28

2 Answers2

8

This is only possible due to a weird, slightly-sketchy optimization for string concatenation in the bytecode evaluation loop. The INPLACE_ADD implementation special-cases two string objects:

case TARGET(INPLACE_ADD): {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(tstate, left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    else {
        ...

and calls a unicode_concatenate helper that delegates to PyUnicode_Append, which tries to mutate the original string in-place:

void
PyUnicode_Append(PyObject **p_left, PyObject *right)
{
    ...
    if (unicode_modifiable(left)
        && PyUnicode_CheckExact(right)
        && PyUnicode_KIND(right) <= PyUnicode_KIND(left)
        /* Don't resize for ascii += latin1. Convert ascii to latin1 requires
           to change the structure size, but characters are stored just after
           the structure, and so it requires to move all characters which is
           not so different than duplicating the string. */
        && !(PyUnicode_IS_ASCII(left) && !PyUnicode_IS_ASCII(right)))
    {
        /* append inplace */
        if (unicode_resize(p_left, new_len) != 0)
            goto error;

        /* copy 'right' into the newly allocated area of 'left' */
        _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
    }
    ...

The optimization only happens if unicode_concatenate can guarantee there are no other references to the LHS. Your initial a="d" had other references, since Python uses a cache of 1-character strings in the Latin-1 range, so the optimization didn't trigger. The optimization can also fail to trigger in a few other cases, such as if the LHS has a cached hash, or if realloc needs to move the string (in which case most of the optimization's code path executes, but it doesn't succeed in performing the operation in-place).


This optimization violates the normal rules for id and +=. Normally, += on immutable objects is supposed to create a new object before clearing the reference to the old object, so the new and old objects should have overlapping lifetimes, forbidding equal id values. With the optimization in place, the string after the += has the same ID as the string before the +=.

The language developers decided they cared more about people who would put string concatenation in a loop, see bad performance, and assume Python sucks, than they cared about this obscure technical point.

user2357112 supports Monica
  • 215,440
  • 22
  • 321
  • 400
1

Somewhat of a guesswork here - when the GC runs, it's allowed to compact/reorganize the memory. By doing so, it's well within its right to reuse old addresses as long as they are now free. By calling a+="h" you've created a new immutable string, but lost the reference to the string a previously pointed to. This string becomes eligible for garbage collection, meaning the old address it used to occupy can be reused.

Mureinik
  • 252,575
  • 45
  • 248
  • 283
  • That's not what's going on. With `+=` on immutable objects, the old and new values are supposed to have a briefly overlapping lifetime while the new value has been created, but not yet assigned to the variable, so the new value can't be created in the old object's memory. – user2357112 supports Monica Jun 28 '20 at 10:50