Workarounds vs workthroughs

Today I had dental implant surgery. The procedure typically takes an hour. I don’t want to go into great, gory detail, but an implant is a titanium tooth root substitute that is inserted into the jawbone after drilling a hole for the implant. The first part of the procedure involves drilling a hole or more precisely, a narrow hole is drilled, then through a succession of six drilling with successively larger drill bits, the hole is widened. Screwing in the implant then completes the procedure.

When the drill machine was powered on in a pre-surgery test, it would work for a couple of seconds then halt with an ERR 04 code (drill overheat fault) on the LED display. The nurse informed me that the machine had just started acting up, but they needed it to fail more frequently so they could give enough information to the repair technicians. Well today was their lucky (and my unlucky) day. After some experimentation and repeated faults, the staff figured out that if they carefully cycled the power and waited long enough, chances are the drill would restart and work for a while. Waiting long enough seemed to clear the fault most of the time. Keeping a foot on the foot pedal and smoothly operating the drill seemed to prevent it from faulting with an ERR 09 (foot pedal fault). They informed the surgeon and he and they experimented with the operation of the drill for several minutes before starting the procedure.

Even though I might have preferred to reschedule my implant, the team went ahead (without conferring with me). What was I thinking??? What would’ve happened if after the third drilling, the machine stopped functioning? Oh, I shouldn’t forget to mention that a technician was charged with recycling the machine whenever it failed, cuing the surgeon when to restart drilling.

OK, admit it. I’m sure you’ve operated some machinery which occasionally fails. We all are familiar with rebooting computers to clean things up. And I’ve been driving around my 11 year old Volvo for several months now, trying to diagnose why it occasionally won’t start (I’ve finally figured out that if I switch on the ignition while jiggling the shift lever that I can always get it to restart, now that I know how to reliably correct the problem my mechanic says he can easily isolate what’s broken and needs fixing).

I started out my software career as an evaluation engineer. From experience, I know that until you find a way to reliably cause a fault, it is difficult to report a bug that anyone is willing to listen to. Intermittent, apparently random failures are the worst kind. Only when you can reliably produce a failure can you even attempt to isolate the problem. Long-term garbage collection bugs or slow memory leaks are really nasty. But golly! When end users encounter intermittent software failures they typically plunge ahead looking for workarounds. Rarely do users want to isolate a problem if they can find a workaround. They’re on task, and not particularly interested in troubleshooting software. When a physical device acts up, people typically act the same way. In hindsight, I probably should’ve halted the procedure before it starte and scheduled my implant for another day. But they (and I) didn’t want to. I was goal oriented. I’ll be damned if I wanted to go in twice!! And they seemed confident that they could finish the procedure and seemed unconcerned about the intermittent drill malfunction. (I’m wondering what their backup plan was). Maybe today I really was lucky because in spite of faults, there weren’t catastrophic failures.

But back to considering device faults. I’ve always wanted the ability to manually override a device’s fault response behavior when I suspect a faulty sensor. Or at least have a way of running self diagnostics’or something instead of being forced to “jigger a solution”. Cycling power seems like such a hack. What if the faulty device doesn’t restart and I’m in the middle of an important task? What if I am willing to take the risk to keep operating the device because the consequences of it not restarting are worse than continuing on with a suspected faulty fault? Shouldn’t a person be allowed to be in the decision loop in this case? Devices shouldn’t just shut off with an ERR code. I’d much prefer a user interface where I’m allowed to initiate a workthrough (e.g. ignoring a suspected fault) instead of being forced to initiate a potentially problematic workaround (cycling power). The faults and fault lights on my car’s dashboard work this way (I caproceedde to ignore them at my own peril). Perhaps if the drill had really been overheated, a workthough should’ve been prevented. But then the determined surgeon would’ve just cycled power anyway. I’m probably not going to change how people design devices by raising these issues. But I’d be interested in reactions to the idea of designing to allow for workthroughs instead of forcing workarounds.

Can you really estimate complexity with use cases?

I visited with some folks last week who failed to get as much leverage from writing use cases as they’d hoped. In the spirit of being more agile, at the same time they adopted use cases, they also streamlined their other traditional development practices. So they didn’t gather and analyze other requirements as thoroughly as they had in the past. Their use cases were high level (sometimes these are called essential use cases) and lacked technical details or detailed descriptions of process variations or complex information that needed to be managed by the system. But their problem domain is complex and varied, prickly, and downright difficult to implement in a straightforward way (and use cases written at this level of detail failed to reveal this complexity). Because of the level of detail, they found it difficult to use these use cases to estimate the work involved to implement them. In short, these use cases didn’t live up to their expectations.

Were these folks hoodwinked by use case zealots with an agile bent? In Writing Effective Use Cases, Alistair Cockburn illustrates a “hub-and-spoke” model of requirements. A figure in his book puts use cases in the center of a “requirements wheel” with other requirements being spokes. Cockburn states that, “people seem to consider use cases to be the central element of the requirements or even the central element of the project’s development process.”

Putting use cases in the center of all requirements can lull folks into believing that if they have limited time (or if they are trying to “go agile”) they’ll get a bigger payoff by only focusing on the center. And indeed, if you adopt this view of “use cases as center”, it’s easy to discount other requirements perspectives as being less important. If you only have so much time, why not focus on the center and hope the rest will somehow fall into place? If you’re adopting agile practices, why not rely upon open communications between customers (or product owners or analysts) and the development team to fill in the details? Isn’t this enough? Maybe, maybe not. Don’t expect to get early accurate estimates by looking only at essential use cases. You’d be just as well off reading tea leaves.

Cockburn proposes that, “use cases create value when they are named as user goals and collected into a list that announces what the system will do, revealing the scope of a system and its purpose.” He goes on to state that, “an initial list of goals will be examined by user representatives, executives, expert developers, and project managers, who will estimate the cost and complexity of the system starting from it.” But if the real complexities aren’t revealed by essential use cases, naive estimates based on them are bound to be inaccurate. The fault isn’t with use cases. It’s in the hidden complexity (or perhaps naive optimism or dismissal of suspected complexity). A lot of special case handling and a deep, complex information model makes high-level use cases descriptions a deceptive tool for estimation. That is unless everyone on the project team is brutally honest about them as just being a touchpoint for further discussion and investigation. If the devil is in the details, the only way to make reasonable estimates is to figure out some of those details and then extrapolate estimates based on what is found. So domain experts that know those details had better be involved in estimating complexity. And if technical details are going introduce complexity, estimates that don’t take those into account will also be flawed. Realistically, better estimates can be had if you implement a few core use cases (those that are mutually agreed upon as being representative and prove out the complexities of the system) and extrapolate from there. But if details aren’t explained or if you don’t perform some prototyping in order to make better estimates, you won’t discover the real complexities until you are further along in development.

I’m sure there are other reasons for their disappointment with use cases but one big reason was a misguided belief that high-level use cases provide answers instead of just being a good vehicle for exploring and integrating other requirements. In my view, use cases can certainly link to other requirements, but they just represent a usage view of a system. One important requirement for many systems, but not the only one. If they are a center, they are just one of many “centers” and sources of requirements.