Today I had dental implant surgery. The procedure typically takes an hour. I don’t want to go into great, gory detail, but an implant is a titanium tooth root substitute that is inserted into the jawbone after drilling a hole for the implant. The first part of the procedure involves drilling a hole or more precisely, a narrow hole is drilled, then through a succession of six drilling with successively larger drill bits, the hole is widened. Screwing in the implant then completes the procedure.
When the drill machine was powered on in a pre-surgery test, it would work for a couple of seconds then halt with an ERR 04 code (drill overheat fault) on the LED display. The nurse informed me that the machine had just started acting up, but they needed it to fail more frequently so they could give enough information to the repair technicians. Well today was their lucky (and my unlucky) day. After some experimentation and repeated faults, the staff figured out that if they carefully cycled the power and waited long enough, chances are the drill would restart and work for a while. Waiting long enough seemed to clear the fault most of the time. Keeping a foot on the foot pedal and smoothly operating the drill seemed to prevent it from faulting with an ERR 09 (foot pedal fault). They informed the surgeon and he and they experimented with the operation of the drill for several minutes before starting the procedure.
Even though I might have preferred to reschedule my implant, the team went ahead (without conferring with me). What was I thinking??? What would’ve happened if after the third drilling, the machine stopped functioning? Oh, I shouldn’t forget to mention that a technician was charged with recycling the machine whenever it failed, cuing the surgeon when to restart drilling.
OK, admit it. I’m sure you’ve operated some machinery which occasionally fails. We all are familiar with rebooting computers to clean things up. And I’ve been driving around my 11 year old Volvo for several months now, trying to diagnose why it occasionally won’t start (I’ve finally figured out that if I switch on the ignition while jiggling the shift lever that I can always get it to restart, now that I know how to reliably correct the problem my mechanic says he can easily isolate what’s broken and needs fixing).
I started out my software career as an evaluation engineer. From experience, I know that until you find a way to reliably cause a fault, it is difficult to report a bug that anyone is willing to listen to. Intermittent, apparently random failures are the worst kind. Only when you can reliably produce a failure can you even attempt to isolate the problem. Long-term garbage collection bugs or slow memory leaks are really nasty. But golly! When end users encounter intermittent software failures they typically plunge ahead looking for workarounds. Rarely do users want to isolate a problem if they can find a workaround. They’re on task, and not particularly interested in troubleshooting software. When a physical device acts up, people typically act the same way. In hindsight, I probably should’ve halted the procedure before it starte and scheduled my implant for another day. But they (and I) didn’t want to. I was goal oriented. I’ll be damned if I wanted to go in twice!! And they seemed confident that they could finish the procedure and seemed unconcerned about the intermittent drill malfunction. (I’m wondering what their backup plan was). Maybe today I really was lucky because in spite of faults, there weren’t catastrophic failures.
But back to considering device faults. I’ve always wanted the ability to manually override a device’s fault response behavior when I suspect a faulty sensor. Or at least have a way of running self diagnostics’or something instead of being forced to “jigger a solution”. Cycling power seems like such a hack. What if the faulty device doesn’t restart and I’m in the middle of an important task? What if I am willing to take the risk to keep operating the device because the consequences of it not restarting are worse than continuing on with a suspected faulty fault? Shouldn’t a person be allowed to be in the decision loop in this case? Devices shouldn’t just shut off with an ERR code. I’d much prefer a user interface where I’m allowed to initiate a workthrough (e.g. ignoring a suspected fault) instead of being forced to initiate a potentially problematic workaround (cycling power). The faults and fault lights on my car’s dashboard work this way (I caproceedde to ignore them at my own peril). Perhaps if the drill had really been overheated, a workthough should’ve been prevented. But then the determined surgeon would’ve just cycled power anyway. I’m probably not going to change how people design devices by raising these issues. But I’d be interested in reactions to the idea of designing to allow for workthroughs instead of forcing workarounds.