When it comes to error messages, and error recovery, Our team is often asked - usually unknowingly, to make an assumption about the current state of the system, and how we can recover from that assumed state. From the everyday world, most of us know to avoid making assumptions, as the popular saying goes - they make asses of us. The same is true in the software world, and the cost of making a bad assumption can be just as big.
For one of our customers, unplugging a hard drive in the middle of a test can be destructive. As a result, any drive that is pulled out mid test needs to go through an extra stringent failure analysis which takes time, personnel, and money. But in addition, investigations are launched to determine how a drive could be pulled out mid test. How do we know this has happened? Our customer logs a helpful error message to tell us:
Drive unplugged while test in progress. Stopping all tests...
Fairly good error message yes? Not only does it tell us what went wrong, but it tells us what action we will be taking - "if only all error messages looked like that!" I hear you cry.
What isn't clear is that the error "Drive unplugged while test in progress" is making an assumption, you can't see it because it's in the code.
The assumption in question seems, at first glance, to be perfectly acceptable. The assumption is this:
"If we have made a call to start the test, the test will be running" - Again, this assumption isn't very clear in the code itself, as if often the case in imperative programming, it's all about changing the state. Booleans are sent in certain places and read in others and all of this makes it quite difficult to spot these assumptions. So why can't we be making this assumption? Because it's based on a very simplistic view of what it means to request (or call) something. In concurrent and distributed systems (of which this is an example) we can't guarantee this. Some calls are added to queues and have to wait their turn before being executed, some messages get lost as they travel around the network.
This means that if a drive hasn't started a test - because either it hasn't got the message yet, or because the message is lost and our drive is idle, then we might, as part of our recovery decide to move it, and BANG. That's when our assumption bites us. There isn't a test running, due arguably to an error on our part that as good vendors, we want to recover from, but we can't. Because an incorrect assumption is being made. From this stems a lot of work to test these drives, an investigate how they could be unplugged mid test, all because of a bad assumption in the code.
So what is the solution? We took the approach of telling the user exactly what has gone wrong, if a sensor is in the wrong state this is all we report, we don't use it to make a guess at the failure mode. You won't see an unexpected sensor value being reported as "Failed to actuate gripper" because the problem might just be a faulty sensor. There is obviously a lot of debate about what a good or bad error message is, I will save this topic for another post.
No comments:
Post a Comment