Russell Hoffman ("Host"), High Tech Today
Peter G. Neumann ("PN"), Author and host of the Internet Risks Forum
Host: ...And you're listening to High Tech Today with your host, Russell Hoffman. My guest today is Peter G. Neumann. Dr. Neumann is the author of "Computer Related Risks." He's the moderator the Internet Risks Forum and works for SRI International in Menlo Park, California.
The book is published by Addison-Wesley, 1995, and in it, he describes, oh I would say hundreds, of errors that have occurred that have had in some cases major problems and in other cases, haven't really caused many problems at all but nonetheless are worth reporting. The book is about trying to figure out how to reduce the number of errors that computers cause. So, I'd like welcome you to the show today, Dr. Neumann.
PN: I'm delighted to be here, thank you.
Host: And, let's see, where should I start with questions... A lot of the book is about minor errors that cause massive headaches and why don't we talk a little bit about what it is that's different about computers that allows them to have such a domino effect, that's not usually seen in life?
PN: Well, if you're building a bridge, for example, and one rivet falls out, it doesn't have much effect on the bridge. In a computer program if you change one bit, it can have cataclysmic effects in the sense that the program can do something totally different from what it's intended to do. So you're dealing with sort of the difference between the, oh, the conventional rubber band world, where you can keep stretching the rubber band a little bit more till it breaks, whereas if you take a computer program analogy, you bend it ever so slightly, or twist it, or just touch it, and the whole thing falls apart.
The problem with computer programming is that you have no real idea of whether the program you've written is going to do the right thing under all possible circumstances at all possible times, and if you look at the massive collection of stuff in the book you see that there are an awful lot of cases where things didn't work the way they were supposed to.
Host: A lot of times it's a combination of errors that causes the problem.
PN: Well, there's a long litany of different types of things, that you'll find in the book. For example human error is always prevalent. Environmental things are always interesting. There are a lot of cases where the environment causes difficulties. For example in medical health we have the pacemaker interference problem. There have been at least two deaths that I know of where people with pacemakers have died because some stray bit of electromagnetic interference interfered with their pacemaker--changed the setting of the rate of the heartbeat, and wound up killing them.
There were five Blackhawk helicopters that crashed, that were due to--that were apparently due to electromagnetic interference. They finally added some shielding to the helicopter circuitry and that stopped that particular problem.
While I'm on that particular thing, the Sheffield was the British vessel in the Falklands war, and they had to turn off their own radar because their own communications were jamming their radar, and that was the thing that let the missile through to blow up the ship.
There are a lot of cases where the environment itself is the cause, but in most cases it's programming error, it's design error. It's having the wrong set of--the wrong concept of the system in the first place. In some cases you never should have built the system. You should have just realized that it never would work, and save your money.
Still there?
...[Connection went down so the station took a break]...
Host: And, welcome back. You're listening to High Tech Today with
your host, Russell Hoffman and my guest is Peter G. Neumann, and we
were both knocked of the air by a phone system flaw! So Peter, why
don't you tell us a little about some of the problems that phone
systems tend to have!
PN: That's very interesting! (Laughs) It reminds me of the time I gave a talk in Washington, and while I was driving back to the airport, the entire Washington airport was wiped out electrically, the power went down, and the airport was closed, and it was a real mess!
Every time I seem to talk about a particular topic something nasty seems to happen! Maybe I'm the old Joe Bisquicpic character from Lil' Abner!
In general, what happened I guess was a lightening strike on one of the telephone switches and we both dropped out. Telephone systems have been taking a terrible beating lately from cable cuts that have knocked down, for example, all three metropolitan New York airports, the Chicago Hinsdale thing which brought down massive computer communication outages as well, due to fires and floods.
Lots of cable cuts. There have been maybe 15 different cable cuts where somebody actually runs a backhoe through the communications cable and downs airport communications, and computer systems, and stock exchanges and everything else.
There's a lot of programming problems too that are experienced in telephone switching, there's the 1990 outage where AT&T LongLines were down for eleven hours basically, where no calls could go through long distance, the result of a little programming error--a fairly small programming error but it--it kind of multiplied itself rather dramatically.
The telephone infrastructure and in general, communications infrastructures that we depend on in this country are very delicate, even though they're designed to work you know, things like 4 hours's outage in 40 years. Things have not been going that well lately and no matter how carefully you design something, you discover that there are fault modes that you hadn't anticipated. So in generally, the real problem is: How do you design things that will work no matter what happens? And when you try to do that, you discover that you've usually left something out.
Host: That it's actually impossible. Now, I'm a computer programmer, for the last twelve years I've been writing in Assembler language, which some feel that that proves that I'm crazy!--
PN: Not at all.
Host: --But at any rate I have come to the conclusion that every line of code that I write has three errors at least: The first one is a typo and that's pretty easy to fix.
PN: (Laughs)
Host: The second one is a specific--what I would call a specific logic error where the command itself is not the right command.
PN: Yep.
Host: And the third on is, once I've got the right command, it doesn't work in coordination with the other commands around it and with other commands hundreds of thousand of lines away from it.
PN: Yeah.
Host: I guess, talk a little bit about what computer programmers should--sort of, what philosophy they should follow to write better programs.
PN: Well, there's this wonderful quote from Einstein which says that everything should be as simple as possible, but no simpler. And we're in this mentality, I think, where we try to oversimplify everything. And, the first thing you've got to do is deal with the complexity of the problem that you're trying to program around, and accept that fact that it's not altogether simple. Once you've done that you say "Well, what are all the conditions that I have to anticipate? What are the things that could go wrong?"
Too often, somebody writes a little program that works just fine in isolation, and you sort of put your finger on the problem that when you put it together with something else it doesn't quite do what it's supposed to do.
The real beauty of this whole thing is to try and develop some sort of structure to the program that you're developing so that it has some simplicity to it if you look at it from a fairly abstract point of view, and that the details are down in the individual little pieces rather than in the interconnections.
As soon as you build something where all of the complexity is in the interconnections, it fails. Human systems do the same thing. Where you have to depend on a lot of people to get together within a very narrow time frame to do something that will work perfectly--you lose. Because it doesn't work that way. People are not capable of performing perfectly on very tight time schedules.
Well, computers don't either. The first shuttle launch is a fine example, where they discovered that the backup computer was not synchronized properly with the four primary computers and they had to delay the launch for two days 'till they could find out what the problem was. Very, very subtle error, and it was very hard to find, even after they began to suspect what it was.
So we're in a funny position here, where when you try to build very large systems, you run into enormous difficulties. And yet, the universities tend to teach you--the trade schools or whatever tend to teach that if you can write a little program you're fine--you know everything you need to know. Well, it's much more than that--it's how you put it all together.
Host: And how you interface it to life. You had mentioned pacemakers earlier; my wife wrote a program to merge two pacemaker manufacturer's company databases of pacemaker users.
PN: Wow--Fascinating.
Host: And the idea is that every three months they have to be sent a letter telling them that their pacemaker battery might be running down. And she refused to code an automatic name deletion program.
PN: Right. That is very wise.
Host: It had to be looked at specifically by a human being to compare, to make sure that nobody was improperly dropped.
PN: I think that's wonderful!
Host: Yeah, she's good... I noticed in your book that you have about ten pages of errors in nuclear power plants and nuclear facilities--
PN: Yeah--
Host: --And you also have about ten pages of errors in space exploration and space activities--
PN: Yeah--
Host: --And one of the plans that they've had for getting rid of nuclear waste is to rocket it to the sun. Now it seems to me that we're talking here about a multiplication of errors. One really has to wonder whether or not this is a reasonable solution to a deadly problem that we've got on our hands right now.
PN: Yeah. It's interesting. It's certainly not directly computer related, but that's never stopped me from commenting on anything before!
The key problem here I think, is we have to look at the problem, whatever it is, in a very global perspective. If you look at nuclear power in the small, you say "Gee, it's cheap, it's clean, it's efficient, and it will solve our energy problems for the next, uh, how many thousand years." If you look at it in the large, you come to the conclusion that there is no solution to the waste disposal problem, there are serious long term health effects. Certainly the Chernobyl experience is not very reassuring. They're now talking about what, thirty thousand people who've died at this point? The number keeps rising, whatever it is.
Host: Yeah.
PN: It's gotten enormous. And, so the real question is, are you optimizing something from your own pocketbook, or from your own personal perspective, or are you optimizing from the point of view of the planet or the universe or whatever?"
Related Material Outside this Web Site:
Dr. Neumann's homepage in cyberspace
Please distribute this document IN ITS ENTIRETY (as a raw HTML file or printed document). Please link to it rather than placing it on another server. Thank you.
The Animated Software Company
http://www.animatedsoftware.com
rhoffman@animatedsoftware.com
Last modified March 27th, 1997.
Webwiz: Russell D. Hoffman
Copyright (c) Russell D. Hoffman