Friday, December 2, 2011

String equality, identity and interning in Python

In a list of things I should have already known comes this. The difference between using 'is' and == on strings in Python.

Let's look at two strings. One unicode (u"unicode string") and one not "not unicode string".

Python 2.7.2+ (default, Oct  4 2011, 20:03:08) 
>>> type("foo")
type 'str'
>>> type(u"foo")
type "unicode"
>>> u"foo" == "foo"
True
>>> u"foo" is "foo"
False

So using == shows the two strings as equal, and 'is' doesn't. What's going on here?

Python interns its strings. Which means only one copy of each distinct string is stored. You can see this by using the built-in function id() to see the identity of our strings.

>>> a = "foo"
>>> b = "foo"
>>> c = u"foo"
>>> print id(a)
3074129864
>>> print id(b)
3074129864
>>> print id(c)
3074128400
You can see our normal strings have the same id because they are the same object. Our unicode string has a different id to our two 'normal' strings. Using the == operator asks python to compare equality of our two strings. Using 'is' compares the identity. As our unicode and normal string are different objects, comparing with 'is' returns false.

I wonder how many of us are guilty of misusing 'is' on strings?