I'd like to understand how such a pinnacle of human design and engineering came ...

bumby · on Dec 8, 2021

Software assurance within NASA is often a low-priority if not just an afterthought. Many project/program managers are from the hardware side (e.g., mechanical, electrical, or industrial engineers) and don't always give the appropriate gravitas to software in terms of its ability to contribute to failures.

wolverine876 · on Dec 8, 2021

What is that based on? And has NASA had many software failures? Their missions seem incredibly reliable, especially considering how far beyond the bleeding edge they operate.

bumby · on Dec 9, 2021

>What is that based on?

Personal experience as a civil servant with NASA. Often, quality aspects take a back seat when schedule pressure is high. It’s the whole reason their current safety and mission assurance org became a separate entity after Columbia

>And has NASA had many software failures?

There's been quite a few high profile ones like Mars Climate Orbiter and more recently with CST-100. In the case of the latter, there were clear process gaps that should have caught the issues if the software assurance procedures were actually followed. Note that last one is for a non-crewed test of a vehicle meant to take people into orbit; presumably this is the highest threshold for quality procedures. There are many, many more lower profile ones that don't get talked about, even within the agency, dating back to Gemini.

>especially considering how far beyond the bleeding edge they operate.

I know that's the public perception. Much of the missions are bleeding edge (because there's very little incentive for anyone except the government to do them), but you might be surprised about how they don't always use state-of-the-art tech. Now some of that is by prudent choice because you'll often prefer tried-and-true of bleeding-edge, but some of it is just because of complacency.

wolverine876 · on Dec 9, 2021

Thanks for sharing your perspective. I still don't grasp what you saw there:

> There's been quite a few high profile ones like Mars Climate Orbiter and more recently with CST-100.

Mars Climate Orbiter failed in the 1990s. Isn't CST-100 (Boeing Starliner) still in development (I don't remember the latest).

I don't doubt you have something in mind; I'd be interested in knowing what it is. Is it the idea of seeing sausage being made - it's messy and doesn't fit the public image? That I would completely expect. IMHO, that's true of every organization; failure is succumbing, success is delivering regardless.

> Much of the missions are bleeding edge ... but you might be surprised about how they don't always use state-of-the-art tech.

It's not about state-of-the-art tech, but addressing novel engineering problems far beyond where there is mature, developed knowledge. Helicopters on Mars, intersteller probes - it's incredible to me that these things reliably succeed. Will Europa Clipper not reach its destination? Is anyone even worried? They succeed, it seems to me, at a much higher rate than run-of-the-mill corporate software projects.

bumby · on Dec 9, 2021

A few things:

1) I don't think 'run-of-the-mill corporate software projects' makes a good comparison. For one, the NASA projects referenced don't come about very often so there's a relatively small sample size. Secondly, they are a completely different risk profile and naturally have different quality expectations. NASA does quite a lot of home-grown CRUD apps, but nobody really hears about them because they just aren't that interesting. A fair number of them are really, really bad. Like, no real configuration management or change control, no test plans or reports, nil unit testing, changes made on the fly to production systems, using extremely antiquated development platforms etc. Some of the reasons are there's limited software assurance so naturally NASA focuses resources on the high-risk/high-profile projects, meaning business software is easier to fall through the cracks. Another reason is that NASA work is predominately contractor supported, meaning much of the work is done by lowest-bidder. It's much easier to be the lowest bidder if you keep your costs low; sometimes this results in lower quality developers. Why pay a high developer salary when I can grab someone who wrote VBA 25 years ago and I can just give them the title of 'Lead Developer'? When there's lack of oversight and downward pressure for costs, this is more common than someone would hope. My hunch is that if you did an apples-to-apples comparison of 'run-of-the-mill corporate software projects' with similar NASA business applications, you might be surprised at which is better.

2) I know the Mars Climate Orbiter is an old example but I referenced it because it's the kind of glitch that people immediately understand without any background knowledge. One group was writing software in metric engineering units and another group used Imperial engineering units, obviously causing a hand-off/interaction error.

So let's dive a bit more into CST-100 since that's a newer project. I'll try to be careful to only talk about stuff that's publicly available. Yes, CST-100 is still in-development. But the demo flight which caused concerns about software quality was meant to essentially an end-to-end check that the system was ready for use; it was meant to be one of the last checkboxes, meaning there is really no reason for glaring errors. In that demo, it couldn't make it to orbit because it burned too much fuel. It burned too much fuel because it incorrectly sync'd it's mission timer with the launch rocket and the spacecraft was confused about where it was in the mission duration. Later when troubleshooting on the ground, they found additional software errors where propellant valves were incorrectly mapped within the software (meaning when they try to command thruster A, they inadvertently fire thruster B). This latter issue potentially could have been catastrophic by causing CST-100 to crash into the space station when docking [1]. To a certain extent, they were lucky the first software error prevented a docking scenario. Troubleshooting all of this is a big reason why the system is still "in development" despite the first demo mission being nearly two years ago. Pay attention to wording in these types of press releases; a lot of times you won't find the word error for failure. They'll instead put some PR spin on it an call it an 'anomaly' or 'unexpected test result' when in reality it's a red flag for lack of quality. If you hear those terms, there's a good chance there was some procedural check that should have been conducted but wasn't. In the example of that Demo, ground simulations on a high-fidelity system could have caught them before the mission. There are requirements already on the books for this [2].

It's not just about peeling back the curtain and seeing how the sausage is made, it's more about an organization having high-minded goals where they have requirements to a certain standard of work, but in practice they often turn a blind eye to those standards. It's akin to if someone who worked for Google in the "don't be evil" days and felt like they weren't living up to that mantra.

3) A small nuance. Many of the robotic, non-human rated missions that get in the news are Jet Propulsion Laboratory projects. JPL does fantastic work, but they are quasi-NASA and are actually generally managed by Cal-Tech. As such, they follow different rules than NASA and there are actually only a handful of true civil servant NASA employees at JPL. NASA of course supports those missions, but they are a bit of a different animal. I believe Europa Clipper falls into that category.

[1] https://www.space.com/boeing-starliner-2nd-software-glitch-p...

[2] https://swehb.nasa.gov/display/SWEHBVC/SWE-073+-+Platform+or...

wolverine876 · on Dec 9, 2021

I have no doubt that NASA's 'business' software is like everyone else's. It would be a waste to invest in high assurance, high talent development for the HR and bug tracking systems.

> It's not just about peeling back the curtain and seeing how the sausage is made, it's more about an organization having high-minded goals where they have requirements to a certain standard of work, but in practice they often turn a blind eye to those standards.

My perspective: Few people live up to the high-minded goals; we're human. We achieve great things when, after experiencing humanity, we don't despair but keep our faith and enthusiasm for those goals. When the founders of the US wrote the Declaration of Independence, they were not naive - they had lots of experience of humanity (including their own), much worse than what we know. Yet they believed in something higher, beyond themsleves. If they didn't, if NASA didn't, we'd still be living in a early modern monarchy and not flying to Jupiter.

Thanks for sharing yours! TIL a few things.

bumby · on Dec 9, 2021

Perhaps I’m jaded, but I think there’s a difference between aiming for a high goal but missing because we’re human vs. deliberately aiming short because it’s easier. I saw a lot of the latter, like refusing to learn how to tune PID parameters to manage system dynamics or saying they don’t want to run certain tests because they would be expected to fix any problems they uncover(!).

The lowering of standards is particularly troublesome when higher standards are contractually obligated. There’s a sad phrase that I had heard at high levels called the “NASA salute” which is basically shrugging one’s shoulders to say “yeah, I know I’m supposed to do that, but I also know I won’t be held accountable if I don’t”

beebmam · on Dec 8, 2021

The fact that a project as profoundly important as the James Webb telescope only has a $11 Billion budget is staggering to me.

BatFastard · on Dec 8, 2021

It had a 1.5 billion budget, 9.5 billion in overruns.

nickff · on Dec 8, 2021

Well, it started off with a $0.5 BB budget, and was supposed to be launched about 14 years ago...