Dealt with an annoying sorting bug in Python this week. Just another pitfall of a dynamically typed language I suppose.
The problem surfaced when trying to combine the values of a dictionary into a single string. A simplified example of how this turned up looked something like this:
a_dictionary = {
'thing_one': ...,
'thing_two': ...,
'thing_three': ...,
...
}
def make_a_string_of_it(**kwargs):
return '-'.join(sorted(kwargs.values()))
The result of calling make_a_string_of_it
with the exact same data
would sometimes return a different string. More confusingly, the
number of elements in the dictionary and the values themselves seemed
to matter.
My co-workers and I are all well aware of how, under Python 2,
dictionaries are not ordered. We understand that the values can come
out in whatever order they ended up hashing into when the dictionary
was created. We resolved to work around this in the first place by
calling sorted
on the values before creating the result string.
After spending a while interrogating different dictionaries and
digging into CPython's
dictobject.c
implementation to convince myself that the .values()
method was
really doing nothing very interesting besides walking the internal
table in a for-loop, we came up with a minimal example that didn't
even require a dictionary.
>>> bar = ['foo', u'bar', ()]
>>> sorted(bar)
[(), u'bar', 'foo']
>>> foo = ['foo', (), u'bar']
>>> sorted(foo)
['foo', (), u'bar']
Two lists, of the same elements, when sorted, producing different
results. This fit the bug we were seeing, the first would turn into a
string ()-bar-foo
while the second would become foo-()-bar
.
Taken pairwise, both lists are actually "correct":
from itertools import tee
# https://docs.python.org/2/library/itertools.html#recipes
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
try:
next(b)
except StopIteration:
pass
return zip(a, b)
def compare_order(some_list):
for left, right in [i for i in pairwise(sorted(some_list))]:
print('{!r} < {!r}: {}'.format(left, right, (left < right)))
compare_order(['foo', u'bar', ()])
() < u'bar': True
u'bar' < 'foo': True
compare_order(['foo', (), u'bar'])
'foo' < (): True
() < u'bar': True
()
is "less than" u'bar'
'foo'
is "less than" ()
u'bar'
is "less than" 'foo'
Obviously, there is a lack of something like total_ordering across the different types here. I haven't entirely sussed it out, but I think the problem may lie in how Python's timsort guarantees the sorting to be stable, meaning list elements will retain their original order (sort-of, see link for more detail). Since the comparisons aren't totally unambiguous, it tries to retain the original ordering where it can and reaches a sort that suffices but is not unique.
A confounding issue in all of this was how dictionary ordering could sometimes mask the issue. If the keys are "large enough", they'll hash into different "buckets" of the hash table and have the potential to come out of the values method in a different order. Because of the opaque nature of the hashing, it isn't clear when "sort" is doing something asinine as compared to the dictionary treating the data differently.
Thankfully, this isn't an issue in Python 3. Attempting as much
results in a TypeError due to unorderable types. While it is nice to
know that this is fixed in later versions, I'm sure I'll be dealing
with Python 2 for a while longer
still. Until then, a solution for the above was to call str
over the
list of values before sorting.
I must say, I'm glad I've taken the time before to understand Python internals1 a bit better. Coming at some of these things cold can be absolutely baffling.