惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fox-IT International blog
Recent Announcements
Recent Announcements
D
Docker
IT之家
IT之家
B
Blog
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
量子位
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
罗磊的独立博客
博客园 - 司徒正美
李成银的技术随笔
美团技术团队
Blog — PlanetScale
Blog — PlanetScale
雷峰网
雷峰网
The GitHub Blog
The GitHub Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
T
The Blog of Author Tim Ferriss
酷 壳 – CoolShell
酷 壳 – CoolShell
MongoDB | Blog
MongoDB | Blog
P
Proofpoint News Feed
L
LangChain Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Y
Y Combinator Blog
大猫的无限游戏
大猫的无限游戏
有赞技术团队
有赞技术团队
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
Visual Studio Blog
T
Tailwind CSS Blog
H
Help Net Security
Engineering at Meta
Engineering at Meta
小众软件
小众软件
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
月光博客
月光博客
M
Microsoft Research Blog - Microsoft Research
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog

Charting the Unknown: An Observationalist’s Log

A crumb for breakfast If Neil Armstrong Were Your Engineer, You Wouldn’t Need Alerts This is my stick, there are none like it A Night Heron... In the Day Voyager and the Art of Graceful Degradation Hoopoe Celebrating as Hoopoes Do Ghostly Cranes in dawn's early light. Artemis and Apollo: The Systems That Took Them to the Moon — and Brought Them Home Excellence Is a Habit Back to Flight Black Kites on parade in an olive grove.
Problems Before the Real Problem: The First Lessons of Apollo 13
noreply@blogger.com (Robert) · 2026-04-13 · via Charting the Unknown: An Observationalist’s Log

Setting the stage

Between 1968 and 1972 NASA sent nine Apollo missions to the Moon. Reaching the Moon was an effort that required a decade of work by 400,000 people, billions of dollars and an incalculable amount of moving parts. Of the nine missions to the Moon, eight were spectacular successes and one was a spectacular near-disaster.

Apollo 13 launched on April 11th, 1970 and was meant to be the first of the Apollo missions to be dedicated to exploring the Moon, after Apollo 11 made the first landing and 12 had improved on it by making a pinpoint landing.

On April 13th, a little over 2 days after launch, one of the two large oxygen tanks in the Service Module component of the spacecraft exploded, crippling the spacecraft on the way to the Moon. Over the next days, NASA’s engineers worked feverishly together with the astronauts to overcome the seemingly insurmountable problems and brought them back to Earth - safe and sound.

   

When looking at Apollo 13 and its problems and solution, what stands out is not how much the astronauts, engineers, and managers improvised to solve unexpected problems but rather the reverse - how their existing procedure was ready to be adapted to the unexpected.

Aptly, Apollo 13’s command module was named Odyssey, meaning a long voyage usually marked by many changes of fortune.

Even before Odyssey began its odyssey, it had an unusual start when it became the first mission where the flight crew was disrupted just before launch.

Three's a team

Flying a spacecraft is a complicated activity; so many things happen simultaneously, there are more buttons to press and procedures to follow than a single person can deal with at any one time. While Mercury, the first spacecraft, had been simple enough for one astronaut to handle, Apollo was a much larger and complex beast. 

The three components of the Apollo spacecraft — the Command Module (CM) with the astronauts, the Service Module (SM) with the supplies and main engine for the flight to the Moon, the Lunar Module (LM) for the landing itself. During the launch, the LM was shielded within the Saturn 5 rocket. (NASA)
 

Instead of expecting a single astronaut to control the spacecraft from beginning to end, the work was divided between three astronauts. The Commander, the Command Module Pilot and the Lunar Module Pilot - note that no astronaut is a mere co-pilot ;)

Now, each astronaut was able to specialize in their specific part of the mission (while remaining competent in other parts too), but astronauts were also able to support each other. After the lightning strike which crippled Apollo 12 at launch, the astronaut who flipped the “SCE to Aux” switch in the Command Module panel was Alan Bean, because he had the easiest access to the critical switch, despite being him being the Lunar Module pilot.

Taking the idea a step forward, in addition to having the three astronauts support each other during the flight, NASA also designated a “backup astronaut” for each one. The backup astronaut underwent nearly the same amount of training as the astronaut designated to fly and was sent to represent him at planning meetings (always fun!). Like being an understudy, the backup astronaut was available to replace the prime astronaut at a moment’s notice, but nothing short of a crippling injury would ever make an astronaut give up his flight. While a few astronauts had been forced to cancel their flights and allow their backups to fly, this had always been as the result of serious conditions (Deke Slayton had heart arrhythmia and Michael Collins had spinal surgery).

 The crew of Apollo 13: Lovell, Swigert, Haise. (NASA)

In the case of the ill-fated Apollo 13, Command Module pilot Ken Mattingly had been inadvertently exposed to Rubella (German Measles) just before the flight and was removed from the flight for medical reasons — despite the usually mild effects of the disease, no doctor was going to take a chance on some exotic and unexpected side effect while the astronauts were about to land on the Moon!

Just three days before the scheduled launch, flight commander James Lovell and Lunar Module pilot Fred Haise set out for a last-minute training regimen with backup Command Module pilot Jack Swigert. One of the few inaccuracies of the 1995 movie Apollo 13 was that the backup astronaut was less capable than the prime astronauts. The purpose of the last-minute training was not to check whether Swigert knew how to fly the spacecraft (which he unarguably did) but to see how the entire crew functioned together as a unit.

The last-minutes changes in the Apollo 13 crew and the way the crew functioned together are examples of the way the astronauts themselves were part of the Resilience and Reliability of the mission.

One must often trade cost for reliability. Training six astronauts instead of three takes more time, money and other resources, but having backups available means that you can recover when the unexpected occurs.

And now the machine: 

While the flesh-and-blood astronauts were the most critical component of the flight (the whole point of the flight was for a man, not a machine, to walk on the Moon), the entire 110 meter (363 foot) stack was built out of millions upon millions of highly reliable engines, pipes, connectors, switches, pumps, gauges, valves, computer chips and more.

The second stage of the Saturn V engine. Note the five J-2 engines which supply a total of 1,150,000 pounds of thrust (NASA)

During the first few minutes of flight, a failure in one of the second stage engines caused the rocket to gyrate wildly and the wayward engine was shutdown seconds before the flight would have been aborted.
As it happens, the engines had been designed with these types of failures in mind and could “pick up the slack”.

The remaining four healthy engines continued firing for longer than planned and made up for the defective engine.

Not even ten minutes into its flight, Apollo 13 had validated the engineering practices of building reliable components by having backups for everything and anything - both man and machine.

In the modern development of reliable software services, we use many patterns and techniques to achieve the reliability we require. While there are many similarities between the requirements of getting a man on the Moon and reaching your chosen website, software development leads to many abstractions that are not relevant for a flight in space. For example, if there’s a temporary failure between your local phone or laptop and the server you’re trying to reach then the local application can “invisibly” retry transient failures until it succeeds or decides that the failure is critical. With any luck, you won’t even notice this issue beyond a very temporary delay in bringing up the screen. While your developers and engineers are (almost certainly) not flying in space, you still train backups for on-call support rotation.

Of course, there are differences - Apollo had to survive with what it launched while software systems can deploy fixes mid‑flight

Now, having overcome Rubella before the flight even began and a failed engine during launch, the Apollo 13 astronauts and the NASA engineers in Houston could relax and enjoy a routine flight to the Moon, couldn’t they?

What else could go wrong?

Watch the movie or stay tuned for more articles to find out!