Error Handling Done Right, Part 1

So I had this idea that error handling has been done badly for years (maybe forever (cf. awk’s “bailing out at line 1”)). There are a couple camps here.The first camp worships at the church of error codes and says, well, just as with BASIC line nu…

So I had this idea that error handling has been done badly for years (maybe forever (cf. awk’s “bailing out at line 1”)).

There are a couple camps here.

The first camp worships at the church of error codes and says, well, just as with BASIC line numbers, you define some error codes with some gaps in them—100 should be enough—and then you shoehorn every possible error condition into what amounts to a really poor compression scheme.

“I couldn’t find the X because the Y was unavailable even though the Z was responding” becomes, simply, lossily, 404.

This is the way that error handling has been done from the late 1960s till (through?) now. Wanna find out what really happened? Start digging, bearing in mind that a “not found”-type error is what you’re really looking for.

The second camp says rather officiously: well, clearly that’s not enough; we need to augment this poor lossy compression scheme with some other poor lossy compression schemes. That will solve all our problems. And so you get ANSI and X/Open dueling on exactly what a SQLState is. Anyway, this extra cryptic number, which never tells you what standards body coined it, conjoined with the equally inscrutable error code number, paired with the vendor’s own set of error codes, makes everything crystal clear. Right? No?

Look, we can all describe the process that we go through when we get a Java stack trace (I’m a Java guy; the rest of this is about Java error handling). It goes something like this:

  1. The developer gets a stack trace (because, er, someone else’s code died, not his).
  2. The developer says, “WTF?” and then begins combing through the piles of garbage. You scan from the top towards the bottom. The top is generally irrelevant; as you get closer to the bottom you get more interested.
  3. At this point the details of the stack trace are entirely irrelevant. They may very well become relevant later, but for now the developer wants to know: was it a SQLException? Was it a JNDI lookup exception? Was it one of those wrapping the other kind? Are there other things that stand out as the elevator goes lower? Scroll, scroll, scroll.
  4. The developer stumbles across the problem, which may very well not be the root cause, but which I bet you is near the bottom. We’re getting closer to the actual carnage.
  5. The developer may notice some other accreted state hanging off the Throwable chain as he descends; often without realizing it, he files it away (“Blah blah blah…foreign key…oh, hmm; interesting….”).
  6. After this quick skim, which typically takes a second or two, the developer can construct a pseudo-literate, pseudo-native-language sentence about what went wrong. “Ahhhh, OK, the frobnicator blew up because it couldn’t find the data source during user login—huh, that’s funny; the user name has some Unicode characters in it—which happened with that crufty old SSO ‘solution’ we purchased from YoYoDyne a while back; I bet they might have something to say here; Joe told me he thought they didn’t use foreign keys…hmm.”

Let’s look at step six.

Step six has the message that anyone remotely technical is going to need to begin to figure out how to fix the problem.

You’ll note that the message in step six did not come from one of the Throwables in the stack trace directly, nor any of its embedded messages. It came from bits of information scattered throughout the stack.

You’ll also note that step six does not feature an error code, though perhaps an error code of some vendor variety was involved (typically in step five). Nor does it feature a so-called disambiguating error code, or an error code that is intended to disambiguate the disambiguating error code. That’s in part because we all ignore error codes as a matter of course because they’re largely useless. That’s because they’re bad, lossy compression schemes that lose the very information you need. But I digress (sorry!).

You’ll notice that step six makes reference to some state that probably occurred higher up the Throwable chain (the fact that the user id had some Unicode characters in it).

Instead, what the developer did was pattern match—the pattern matched the Throwable chain, and, once a pattern he didn’t even know he was looking for was encountered, constructed the appropriate message for that chain (in his head).

So why don’t error messages work this way? Why do we make humans do this?

Because we’re lazy, that’s why, and also because at the point that we throw a Throwable we have no idea what’s going on in the system as a whole. It’s only when we catch a Throwable that we have the tools available to figure out what happened.

See, the thrower only knows that his little piece of the puzzle has failed. Usually this is because some inscrutable bit of machinery beneath him has failed. So he dutifully does the following:

try {
} catch (final FrobnicationException kaboom) {
  throw new BorkificationException("Encountered problem frobnicating", kaboom);

Ye gods. No wonder technologists accrue a reputation for being hopelessly out of touch. By the time this thing has passed up through other layers, we get some godawful stack trace like:

com.yourstartup.EnbratzificationException: Error in enbratzifying
    at com.yourstartup.Enbratzifier.enbratzify(
    [and so on]
Caused by: com.yourstartup.BorkificationException: Borkification died
    at com.yourstartup.Borkifier.borkify(
    [and so on]
Caused by: com.yourstartup.FrobnicationException: Encountered problem frobnicating
    at com.yourstartup.Frobnicator.frobnicate(
    [and so on]
Caused by: com.yourstartup.CaturgiationException: caturgiating didn't work
    at com.yourstartup.Caturgiator.caturgiate(
    [and so on]

Quick! What happened? Oh, no! The whole system…it’s f***ed! We’re all going to die!

But, see, you know. You honed this ability a long time ago.

Because if you’re experienced, it will take you less than a quarter of a second to realize that the Enbratzifier couldn’t borkify because the Frobnicator exploded while caturgiating. And that will probably start you immediately thinking about frobnication in the context of enbratzification, but only when the Borkifier is properly configured, and how come the…. Congratulations; you’ve just engaged in some sophisticated pattern matching, and you’ve come up with a useful error message.

Computers are really good at pattern matching. So how come we don’t use pattern matching for error handling?

Part of the reason is architectural. We’ve all been trained that a layer is supposed to be ignorant of the layers above it, and should know only about the layer beneath it.

But when you surfed through the stack trace above, you didn’t pay any heed to these architectural principles—nor should you have. When errors are involved, architecture goes out the window. That’s because by the time the error has been encountered, the architecture has failed: something died, so you need to go pick up the bodies, and you’ll be damned if some ivory tower notion of isolation is going to stand in your way. Good for you.

So you busted through the layers like a hot knife through butter. You dug down the stack, glossed over and yet somehow retained bits of state (messages) involved in lower layers, and by mining this sludge you came to your conclusion.

You need an error handling library that does the same thing. If a big hairy Throwable chain matches a sophisticated pattern, then the library should let you construct a message based off the information encoded throughout the chain, and in its very structure.

Stay tuned for part 2, in which our hero descends into non-deterministic finite automata theory to bring this to fruition. It’s also where our hero takes a lot of Advil.