Image by NASA on The Commons is licensed under The Commons - cropped to fit ratio

37/52 - Bugs on an Epic scale

I was soldiering on through my backlog when I got to this 2014 - yes, I did say backlog - article The Typo that Destroyed a NASA Rocket (from Priceonomics) about the Mariner I

On July 22, 1962, at 9:20 PM, the Mariner I sat idly on its platform, ready to make history. After investing years of construction, calculation, and funding, NASA had high hopes that its rocket would successfully conduct a flyby survey of Venus, thus shifting the Space Race’s momentum back to the home front. In every way, it was poised to set a space travel precedent.

But when the rocket embarked, it was clear there’d be no cause for celebration: less than 5 minutes into flight, Mariner I exploded, setting back the U.S. government $80 million ($630 million in 2014 dollars). The root cause for this disaster? A lone omitted hyphen, somewhere deep in hand-transcribed mathematical code.

Ok, so it wasn’t really an hyphen, it was something more tangible that actually explains things

The error had occurred when a symbol was being transcribed by hand in the specification for the guidance program. The writer missed the superscript bar (…) by which was meant “the nth smoothed value of the time derivative of a radius R”. Since the smoothing function indicated by the bar was left out of the specification for the program, the implementation treated normal minor variations of velocity as if they were serious, causing spurious corrections that sent the rocket off course. It was then destroyed by the Range Safety Officer.

Ouch

People make mistakes, extremely complex systems are prone to minuscule mistakes that sometimes propagate into full blow systems failure.

As a developer, I am 100% aware of this and aware of how things can happen and how one single mistake - which can happen for any number of reasons - sometimes explodes into systems grinding to a halt, customers support teams having to deal with an avalanche of complaints, money not flowing in and plain nastiness.

However, as a software developer for a software tool, I have a luxury that these guys never had, the ability to make the system run a million times before deploying with absolute neglectable cost. I can spot an error, fix it and rerun the system in seconds, I can have a server pick up my changes and test them six ways from Sunday. Heck, if bad stuff goes live, 99 out of 100 times I’ll merely annoy someone, most of them myself for being an idiot.

You can’t really do that with rockets now can you? :) And fuckups don’t just annoy people, they burn - literally - through millions of Moneys - pick a currency of your choice - they destroy years of work. That is unbelievable pressure that requires extraordinary procedures to minimize as many issues as possible, and if I’m ever in awe of what Humanity can do, it’s when you see projects of this complexity and magnitude being successfully pulled off.

In a world that increasingly looks to destruction and dismantling of projects as a quick solution - pardon my closeness to the whole Brexit debacle - whenever you see something like Rosetta, a project that spans decades, being pulled off, you have to feel proud and awe.

Of course, even when things go wrong, really really wrong, you can still take some pleasure of how dumb some things are, like that time we sent a probe to Mars and one system was using the metric system and another was using the imperial system, 15 years later we can have a laugh can’t we?

Twenty-four hours prior to orbital insertion, calculations placed the orbiter at an altitude of 110 kilometers, 80 kilometers is the minimum altitude that Mars Climate Orbiter was thought to be capable of surviving during this maneuver. Post-failure calculations showed that the spacecraft was on a trajectory that would have taken the orbiter within 57 kilometers of the surface, where the spacecraft likely disintegrated because of atmospheric stresses.

The primary cause of this discrepancy was that one piece of ground software supplied by Lockheed Martin produced results in a United States customary unit, contrary to its Software Interface Specification (SIS), while a second system, supplied by NASA, expected those results to be in SI units, in accordance with the SIS. Specifically, software that calculated the total impulse produced by thruster firings calculated results in pound-seconds. The trajectory calculation software then used these results – expected to be in newton-seconds – to update the predicted position of the spacecraft.

Ouch!