Debugging an archaeological find

When confronted with a bug in a piece of software whose authors are lost in the mists of time, and whose internal workings are opaque and mysterious, debugging can be a challenge.

But that’s why we became engineers, we like challenges.

The first problem is to understand what the nature of the bug is. Typically you get some crash that has a signature that suggests that something went wrong in the machinery of the ancients.

Our first reaction, because we are human is to cry out:

That’s not true. That’s impossible!

How can the machinery of the ancients be broken? It never breaks!

The first challenge is to understand the nature of the breakage. Sitting in a sea of memory with no clue what is going on… you have to begin the process of sifting through the code and the memory to understand what exactly has happened. Not what the bug is, but what the sequence of events that occurred that produced a crash.

The goal is to create a hypothesis that explains how the crash occurred, not the why, but the how.

There is one strategy that is interlinked. The first part is to start reading code and analyzing core files, and looking to see if similar bugs got reported in the past and got swept under the rug. The second part is to desperately and frantically try and reproduce the bug.

Essentially what you are trying to do is gather experimental evidence to guide the analysis of the software and the ancient bug reports.

Once you have figured out how the bug occurred eg: memory got corrupted and that resulted in this set of instructions to execute, the next step is to begin the process of how.

This turns out to be both trickier and easier. Easier because you now know how the crash happens. Trickier because you know need to understand an increasingly larger scope of the system …

In the case of corruptions there are at least two possibilities: structural or wild.

A structural corruption is caused by the code that manipulates the data structure … This is easy because the problem is localised. Tricky because as an archaeoligst you may need to go several layers away from where the crash occurred requiring more analysis to follow the code all the way to the source of the flaw.

More inelegantly, there is a core data structure that is busted, there are other data structures that are related to that data structure, either as inputs or as outputs, being able to see how the dependent data structures look as compared to the corrupted one can guide your investigation. You are looking for the source of the corruption and seeing the input and output data structures can tell you where to start looking depending on their state. Sometimes there are no input and output data structures, just lots of dependent ones but the principal holds.

In this case you want the testing to get narrower and narrower to find the bug faster and in a more focused way. As you get a better understanding of the code and the processes and the internal data structures the testing goes from being – use a thermometer to use an MRI…

Wild corruptions are caused by two unrelated pieces of code causing each other harm. Some random piece of code is causing the ancients code to get fubar’ed. And if the code is vast and large, understanding where that can happen can be hard but not impossible. To attack that problem you a useful approach is to do a brute-force attack on the code to see what combination of features executing in parallel or in isolation can cause the bug. Your goal is to find the places that are running and see how you can find the one piece of code that is doing the wrong thing. This is why reproduction of the bug remains the single most important process in debugging archaeology.

The nice outcome of this process is that in the end the understanding of the mysterious ancient technology is revealed. And with that comes a moment of personal satisfaction. You are now one with the ancients.

And then a desire to rewrite their code overcomes you… and then you too become a new ancient one….

Share this:

Like this:

Leave a ReplyCancel reply