Tuesday, May 5, 2015

Integer Overflow Bug in Boeing 787

Edgar Alvarez (comments):

“A Model 787 airplane that has been powered continuously for 248 days can lose all alternating current electrical power due to the generator control units simultaneously going into failsafe mode,” the FAA said in a statement warning of the flaw. “We are issuing this AD to prevent loss of all AC electrical power, which could result in loss of control of the airplane.” Boeing, for its part, is aware of the problem and has reset the power on 787 Dreamliners currently in service.

Matt McGuire:

It all has to do with Integer math overflow. It could potentially happen on any hardware/software platform. It’s usually a call to something like GetElaspedTime() that will return the amount of milliseconds since the device powered on. If it returns a 32 bit integer (most embedded processors) the maximum is 248 days and some change.

6 Comments RSS · Twitter

Nope. The math is wrong.

@Not Looks to me like McGuire should have said deciseconds instead of milliseconds.

It works out to 5 ms.

@Not Please be more specific.

Look to PaulCagito's post for the closest right answer.

@Not I guess you mean that if it counts in 5 ms increments, an unsigned 32-bit counter would overflow in 248 days (and change). Others were thinking of signed 10 ms counter. PaulCogito’s comment has some more general info:

Now this is known it has ceased to be a problem. Aircraft of any size have to have regular maintenance, and a large passenger jet will have a grey wall of binders holding these procedures. Making sure that the generator control system gets rebooted periodically will go alongside oil changes, instrument calibration, pressurisation checks and all the other things that have to be done to keep the aircraft in service. This is just one more item on the list.

A bigger question is; how do we stop this kind of bug in the future. In my professional life I’ve encountered a very similar bug where an OS counter was monitored by the application, but the application didn’t handle the 32-bit roll-over properly and crashed after a couple of hundred days. So this isn’t unique by any means.
Its easy to say “stupid programmer forgot about the overflow case”. But its not that simple. Maybe the programmer didn’t think of the overflow, or maybe they did but got it subtly wrong. If you are going to rely on human beings to get it 100% right in a complex system you are going to be disappointed. Thats why we have rigorous testing on something like this. But a time-based overflow bug like this is actually quite hard to test; putting the thing on a rig for a few years to see if any overflow bugs happen is not feasible. You would have to have a specific test case that pokes 0xFFFFFF00 into the relevant counter to see what happens. Maybe we need to add those test cases to our lists.

Merely using a 64 bit Int may not be a solution if you are interfacing with the operating system, and the OS designer used 32 bits.

Leave a Comment