Wednesday, March 31, 2010

The perils of side effects, an example

I've just spent two hours of my life debugging a problem that demonstrates perfectly the problem of side effects in functions. The code in question was fairly unremarkable:

print convert_to_customer_format(data, default_item)
customer_interface.send(convert_to_customer_format(data, default_item))

the print line had been innocently added for debug, but it caused an error in the customer interface component, as the line had been added to assist with debug on the customer interface component it was a while before the error was traced back to the code above. We log all calls and returns from functions, so it didn't take long to realise that what we were printing was different to what we were trying to send, so suspicion quickly fell on the conversion function:

def convert_to_customer_format(data, default_item):
#turn into list
for key in data.keys():
data[key].insert(0, key)
data = data.values()
#pad with empty items
default_item.insert(0, 'Empty')
number_of_pads = range(len(data), 16)
for x in number_of_pads:
return data

It's not a very pretty function, but you if you run it like so:

default_item = [0, 0, 0, 0]
data = defaultdict(lambda:default_item)
data['foo'] = [1, 2, 3, 4]
data['bar'] = [5, 6, 7, 8]
x = side_effect(data, default_item)
print x
y = side_effect(data, default_item)
print y

You'll see that it changes the data each time. You wouldn't have this kind of trouble in Haskell! It got me thinking though, could I make a decorator that would force a function to not have side effects, or at least raise an exception if it did. I think I've done it, i'd appreciate comments:

def no_side_effects(func):
def inner_func(*args, **kwargs):
pre_call_args = copy.deepcopy(args)
pre_call_kwargs = copy.deepcopy(kwargs)
print 'pre_call: %s| %s' % (pre_call_args, pre_call_kwargs)
result = func(*args, **kwargs)
print 'post call: %s| %s' % (args, kwargs)
if args == pre_call_args and kwargs == pre_call_kwargs:
return result
raise Exception('Side effect found: Function altered the arguments')
return inner_func

Tuesday, March 2, 2010

The danger of assumptions

When it comes to error messages, and error recovery, Our team is often asked - usually unknowingly, to make an assumption about the current state of the system, and how we can recover from that assumed state. From the everyday world, most of us know to avoid making assumptions, as the popular saying goes - they make asses of us. The same is true in the software world, and the cost of making a bad assumption can be just as big.
For one of our customers, unplugging a hard drive in the middle of a test can be destructive. As a result, any drive that is pulled out mid test needs to go through an extra stringent failure analysis which takes time, personnel, and money. But in addition, investigations are launched to determine how a drive could be pulled out mid test. How do we know this has happened? Our customer logs a helpful error message to tell us:

Drive unplugged while test in progress. Stopping all tests...

Fairly good error message yes? Not only does it tell us what went wrong, but it tells us what action we will be taking - "if only all error messages looked like that!" I hear you cry.

What isn't clear is that the error "Drive unplugged while test in progress" is making an assumption, you can't see it because it's in the code.
The assumption in question seems, at first glance, to be perfectly acceptable. The assumption is this:

"If we have made a call to start the test, the test will be running" - Again, this assumption isn't very clear in the code itself, as if often the case in imperative programming, it's all about changing the state. Booleans are sent in certain places and read in others and all of this makes it quite difficult to spot these assumptions. So why can't we be making this assumption? Because it's based on a very simplistic view of what it means to request (or call) something. In concurrent and distributed systems (of which this is an example) we can't guarantee this. Some calls are added to queues and have to wait their turn before being executed, some messages get lost as they travel around the network.

This means that if a drive hasn't started a test - because either it hasn't got the message yet, or because the message is lost and our drive is idle, then we might, as part of our recovery decide to move it, and BANG. That's when our assumption bites us. There isn't a test running, due arguably to an error on our part that as good vendors, we want to recover from, but we can't. Because an incorrect assumption is being made. From this stems a lot of work to test these drives, an investigate how they could be unplugged mid test, all because of a bad assumption in the code.

So what is the solution? We took the approach of telling the user exactly what has gone wrong, if a sensor is in the wrong state this is all we report, we don't use it to make a guess at the failure mode. You won't see an unexpected sensor value being reported as "Failed to actuate gripper" because the problem might just be a faulty sensor. There is obviously a lot of debate about what a good or bad error message is, I will save this topic for another post.