Work at a large company with a massive code base that has evolved over many years, and you eventually have to engage in Software Archaeology.
Software archaeology is the process of trying to understand critical software systems that are poorly documented, poorly understood or not understood at all and super-critical to your system.
Imagine you have some module that you are dependent on that has worked for years, and then a bug is uncovered.
You have to go and learn what exactly the module does. By the time you look at the code, the original authors, and their children have all left the company or even if they haven’t its been years since they worked on the code… Sometimes it’s been years since they looked at code period, now being managers or directors or vice-presidents…
At times you can feel like you’re one of those modern explorers violating some tribal rules by exploring in areas that are forbidden. And if you have to modify the code, you wonder if you are like Indiana Jones about to remove the gold idol…
The problem isn’t understanding the structure of the code, software is software. The problem is understanding the intent of the code: understanding where choices were made that were well reasoned, and where choices were made that were expedient. And the problem isn’t even the software that you have to inspect, but that it’s part of a broader sea of software that the ancients wrote that is equally opaque and mysterious.
And the real problem is that static analysis is fine, but what you really need is to understand is how the running system behaves. How it uses memory, how it uses the CPU, how the data structures grow and shrink, what the heart beat of the system is…
At SGI in the late 90’s I did some compiler research as a master’s student to try and address this specific problem. In particular how do you infer things like locking hierarchies in a multi-million line code base when the original authors of the code have since left? Just reading the code is insufficient. Knowing that locks are taken is important, but you also need to understand things about contention, frequency of locks, interplay between systems that are so widely split apart to be mysterious …
When I tried to solve the problem, I looked at using compilers to go and insert code everywhere where something looked like a lock and then use the testing infrastructure to find the locks and their hierarchies…
And then I discovered that the code of the ancients because it was always working has no tests.
Doing your run-time analysis involves figuring out how to test things that always worked. And then you uncover not just one bug, but hundreds… Or are your tests broken because you don’t understand the code… And you wonder as you play with this mysterious gadget… Are you about to destroy the world? Or save it? How many of these bugs are real? And how many of them are real but other pieces of software have worked around those bugs making the whole system work?
And what will be the blow-back of fixing them…
And while you sit there with the svn commit about to change the structure of the code of the ancients, you wonder if your hubris is about to bite you in the ass… How could code that has worked for so long, be broken?
And your management team looks at you like those tribal leaders who looked at the explorers, with suspicion and doubt and fear. And you can hear them telling the village youth to arm themselves and kill the interloper before he causes too much harm… Or are they like the hot woman or man begging the evil villain to not destroy the world, or asking the hero:
Are you sure this is going to work?
No fear, the ancients were humans, just like you, you tell them… And that bag of sand is about the same weight as the idol…
Run Indy! Run!
Leave a Reply