What’s Wrong With the Internet and How We Can Fix It: Interview With Internet Pioneer John Day

I appreciate very much the willingness of the editors at Ctrl-Z: New Media Philosophy to publish an updated, revised version of an interview I conducted with the computer scientist and Internet pioneer John Day via email. The published version is available at the link above and the original version is below.

The interview came about as a result of a chapter I’ve been working on for my “Other Networks” project, called “The Net Has Never Been Neutral.” In this piece, I try to expand the materialist bent of media archaeology, with its investment in hardware and software, to networks. Specifically, I’m working through the importance of understanding the technical specs of the Internet to figure out how we are unwittingly living out the legacy of the power/knowledge structures that produced TCP/IP. I also think through how the Internet could have been and may still be utterly different. In the course of researching that piece, I ran across fascinating work by Day in which he argues that “the Internet is an unfinished demo” and that we have become blind not only to its flaws but also to how and why it works the way it works. Below you’ll see Day expand specifically on five flaws of the TCP /IP model that are still entrenched in our contemporary Internet architecture and, even more fascinating, the ways in which a more sensible structure (like the one proposed by the French CYCLADES group) to handle network congestion would have made the issue of net neutrality beside the point. I hope you enjoy and many, many thanks to John for taking the time to correspond with me.

*

Emerson: You’ve written quite vigorously about the flaws of the TCP/IP model that go all the way back to the 1970s and about how our contemporary Internet is living out the legacy of those flaws. Particularly, you’ve pointed out repeatedly over the years how the problems with TCP were carried over not from the American ARPANET but from an attempt to create a transport protocol that was different from the one proposed by the French Cyclades group. First, could you explain to readers what Cyclades did that TCP should have done?

Day: There were several fundamental properties of networks the CYCLADES crew understood that the Internet group missed:

  • The Nature of Layers,
  • Why the Layers they had were there,
  • A complete naming and addressing model,
  • The fundamental conditions for synchronization,
  • That congestion could occur in networks, and
  • A raft of other missteps most of which follow from the previous 5, but some are unique.

First and probably foremost was the concept of layers. Computer Scientists use “layers” to structure and organize complex pieces of software. Think of a layer as a black box that does something, but the internal mechanism is hidden from the user of the box. One example is a black box that calculates the 24 hour weather forecast. We put in a bunch of data about temperature, pressure and wind speed and out pops a 24 hour weather forecast. We don’t have to understand how the blackbox did it. We don’t have to interact with all the different aspects it went through to do that. The black box hides the complexity so we can concentrate on other complicated problems for which the output of the black box is input. The operating system of your laptop is a black box. It does incredibly complex things but you don’t see what it is doing. Similarly, the layers of a network are organized that way. For the ARPANET group, BBN [Bolt, Barenek, and Newman] built the network and everyone else was responsible for the hosts. To the people responsible for the hosts, the network of IMPs was a blackbox that delivered packets. Consequently, for the problems they needed to solve, their concept of layers focused on the black boxes in the hosts. So the Internet’s concept of layers was focused on the layer in the Hosts where its primary purpose was modularity. The layers in the ARPANET hosts were the Physical Layer, the wire; IMP-HOST Protocol; the NCP; and the applications, such as Telnet, and maybe FTP.[1] For the Internet, they were Ethernet, IP, TCP, Telnet or HTTP, etc. as application. It is important to remember that the ARPANET was built to be a production network to lower the cost of doing research on a variety of scientific and engineering problems.

The CYCLADES group, on the other hand, was building a network to do research on the nature of networks. They were looking at the whole system to understand how it was supposed to work. They saw that layers were more than just local modularity but a set of cooperating processes in different systems, and most importantly different layers had different scope, i.e. number of elements in them. This concept of the scope of a layer is the most important property of layers. The Internet never understood its importance.

The layers that the CYCLADES group came up with in 1972 were the following: 1) the Physical Layer – the wires that go between boxes. 2) The Data Link Layer – that operates over one physical media and detects errors on the wire and in some cases keeps the sender from overrunning the receiver. But most physical media have limitations on how far they can be used. The further data is transmitted on them the more likely there are errors. So these may be short. To go longer distances, a higher layer with greater scope exists over the Data Link Layer to relay the data. This is traditionally called the 3) Network Layer.

But of course, the transmission of data is not just done in straight lines, but as a network so that there are alternate paths. We can show from queuing theory that regardless of how lightly loaded a network is it can have congestion, where there are too many packets trying to get through the same router at the same time. If the congestion lasts too long, it will get worse and worse and eventually the network will collapse. It can be shown that no amount of memory in the router is enough, so when congestion happens packets must be discarded. To recover from this, we need a 4) Transport Layer protocol, mostly to recover lost packets due to congestion. The CYCLADES group realized this which is why there is a Transport Layer in their model. They started doing research on congestion around 1972. By 1979, there had been enough research that a conference was held near Paris. DEC and others in the US were doing research on it too. Those working on the Internet didn’t understand that such a collapse from congest could happen until 1986 when it happened to the Internet. So much for seeing problems before they occur.

Emerson: Before we go on, can you expand more on how and why the Internet collapsed in 1986?

Day: There are situations where too many packets arrive at a router and a queue forms, like everyone showing up at the cash register at the same time, even though the store isn’t crowded. The network (or store) isn’t really overloaded but it is experiencing congestion. However in the Transport Layer of the network, the TCP sender is waiting to get an acknowledgement (known as an “ack”) from the destination that indicates the destination got the packet(s) it sent.  If the sender does not get an ack in a certain amount of time, the sender assumes that packet and possibly others were lost or damaged re-transmits everything it has sent since it sent the packet that timed out.  If the reason the ack didn’t arrive is that it was delayed too long at an intervening router and the router has not been able to clear its queue of packets to forward before this happens, the retransmissions will just make the queue at that router even longer.  Now remember, this isn’t the only TCP connection whose packets are going through this router.  Many others are too. And as the day progresses, there is more and more load on the network with more connections doing the same thing.  They are all seeing the same thing contributing to the length of the queue.  So while the router is sending packets as fast as it can, its queue is getting longer and longer.  In fact, it can get so long and delay packets so much, that the TCP sender’s timers will expire again and it will re-transmit again, making the problem even worse. Eventually, the throughput drops to a trickle.

As you can see, this is not a problem of not enough memory in the router; it is a problem of not being able to get through the queue. (Once there are more packets in the queue than the router can send before retransmissions are triggered, collapse is assured.)  Of course delays like that at one router will cause similar delays at other routers.  The only thing to do is discard packets.

What you see in terms of the throughput of the network vs load is that throughput will climb very nicely, increasing, then it begins to flatten out as the capacity of the network is reached, then as congestion takes hold and the queues get longer, throughput starts to go down until it is just a trickle.  The network has collapsed.  The Internet did not see this coming. Nagel warned them in 1984 but they ignored it.  They were the Internet – what did someone from Ford Motor Company know?  It was a bit like the Frank Zappa song, “It can’t happen here.”  They will say (and have said) that because the ARPANET handled congestion control, they never noticed it could be a problem.  As more and more IP routers were added to the Internet, the ARPANET became a smaller and smaller part of the Internet as a whole and it no longer had sufficient influence to hold the congestion problem at bay.

This is an amazing admission. They shouldn’t have needed to see it happen to know that it could. Everyone else knew about it and had for well over a decade. CYCLADES had been doing research on the problem since the early 1970s.  The Internet’s inability to see problems before they occur is not unusual.  So far we have been lucky and Moore’s Law has bailed us out each time.

Emerson: Thank you – please, continue on about what Cyclades did that TCP should have done.

Day: The other thing that CYCLADES noticed about layers in networks was that they weren’t just modules and they realized this because they were looking at the whole network. They realized that layers in networks were more general because they used protocols to coordinate their actions in different computers. Layers were distributed share states with different scopes. Scope? Think of it as building with bricks. At the bottom, we use short bricks to set a foundation, protocols that go a short distance. On top of that are longer bricks, and on top of that longer yet. So what we have is the Physical and Data Link Layer have one scope; the Network and Transport Layers have a larger scope over multiple Data Link Layers. Quite soon, circa 1972, researchers started to think about networks of networks. The CYCLADES group realized that the Internet Transport Layer was a layer of greater scope yet it also operated over multiple networks. So by the mid-1970s, they were looking at a model that consisted of Physical and Data Link Layers of one small scope that is used to create networks with a Network Layer of greater scope, and an Internet Layer over multiple networks of greater scope yet. The Internet today has the model I described above for a network architecture of two scopes, not an internet of 3 scopes.

Why is this a problem? Because congestion control goes in that middle scope. Without that scope, the Internet group put congestion control in TCP, which is about the worse place to put it and thwarts any attempt to provide Quality of Service for voice and video, which must be done in the Network Layer and ultimately precipitated a completely unnecessary debate over net neutrality.

Emerson: Do you mean that a more sensible structure to handle network congestion would have made the issue of net neutrality beside the point? Can you say anything more about this? I’m assuming others besides you have pointed this out before?

Day: Yes, this is my point and I am not sure that anyone else has pointed it out, at least not clearly.  It is a little hard to see clearly when you’re “inside the Internet.”  There are several points of confusion in the net neutrality issue. One is that most non-technical people think that bandwidth is a measure of speed when it is more a measure of capacity.  Bits move at the speed of light (or close to it) and they don’t go any faster or slower. So bandwidth really isn’t a measure of speed. The only aspect of speed in bandwidth is how long it takes to move a fixed number of bits and whatever that is consumes capacity of a link. If a link has a capacity of 100Mb/sec and I send a movie at 50Mb/sec, I only have another 50Mb/sec I can use for other traffic. So to some extent, talk of a “fast lane” doesn’t make any sense. Again, bandwidth is a matter of capacity.

For example, you have probably heard the argument that Internet providers like Comcast and Verizon want “poor little” Netflix to pay for a higher speed, to pay for a faster lane. In fact, Comcast and Verizon are asking Netflix to pay for more capacity! Netflix uses the rhetoric of speed to wrap themselves in the flag of net neutrality for their own profit and to bank on the fact that most people don’t understand that bandwidth is capacity. Netflix is playing on people’s ignorance.

From the earliest days of the Net, providers have had an agreement that as long as the amount of traffic going between them is about the same in both directions they don’t charge each other. In a sense it would “all come out in the wash.” But if the traffic became lop-sided, if one was sending much more traffic into one than the other was sending the other way, then they would charge each other. This is just fair.  Suddenly, because movies consume a lot of capacity, Netflix is generating considerable load that wasn’t there before. This isn’t about blocking a single Verizon customer from getting his movie; this is about the 1000s of Verizon Customers all downloading movies at the same time and all of that capacity is being consumed at a point between Netflix’s network provider and Verizon.  It is even likely they didn’t have lines with that much capacity, so new ones had to be installed.  That is very expensive.  Verizon wants to charge Netflix or Netflix’s provider because the capacity moving from them to Verizon is now lop-sided by a lot.  This request is perfectly reasonable and it has nothing to do with the Internet being neutral. Here’s an analogy: imagine your neighbor suddenly installed an aluminum smelter in his home and was going to use 10,000 times more electricity than he use to.  He then tells the electric company that they have to install much higher capacity power lines to his house and provide all of that electricity and his monthly electric bill should not go up. I doubt the electric company would be convinced.

Net neutrality basically confuses two things: traffic engineering vs discriminating against certain sources of traffic. The confusion is created because of the flaws introduced fairly early and then what that forced the makers of Internet equipment to do to try to work around those flaws.  Internet applications don’t tell the network what kind of service they need from the Net.  So when customers demanded better quality for voice and video traffic, the providers had two basic choices: over provision their networks to run at about 20% efficiency (you can imagine how well that went over) or push the manufacturers of routers to provide better traffic engineering. Because of the problems in the Internet, about the only option open to manufacturers was for them to look deeper into the packet than just making sure they routed the packet to its destination.  However, looking deeper into a packet also means being able to tell who sent it. (If applications start encrypting everything, this will no longer work.)  This of course not only makes it possible to know which traffic needs special handling, but makes it tempting to slow down a competitor’s traffic.  Had the Net been properly structured to begin with (and in ways we knew about at the time), then these two things would be completely distinct: one would have been able to determine what kind of packet was being relayed without also learning who was sending it and net neutrality would only be about discriminating between different sources of data so that traffic engineering would not be part of the problem at all.

Of course, Comcast shouldn’t be allowed to slow down Skype traffic because it is in competition with Comcast’s phone service.  Or Netflix traffic that is in competition with its on-demand video service. But if Skype and Netflix are using more than ordinary amounts of capacity, then of course they should have to pay for it.

Emerson: That takes care of three of the five flaws in TCP. What about the next two?

Day: The next two are somewhat hard to explain to a lay audience but let me try. A Transport Protocol like TCP has two major functions: 1) make sure that all of the messages are received and put in order, and 2) don’t let the sender send so fast that the receiver has no place to put the data. Both of these require the sender and receiver to coordinate their behavior. This is often called feedback, where the receiver is feeding back information to the sender about what it should be doing. We could do this by having the sender send a message and the receiver send back a special message that indicates it was received (the “ack” we mentioned earlier) and to send another. However, this process is not very efficient. Instead, we like to have as many messages as possible ‘in flight’ between them, so they must be loosely synchronized. However, if an ack is lost, then the sender may conclude the messages were lost and re-transmit data unnecessarily. Or worse, the message telling the sender how much it can send might get lost. The sender is waiting to be told it can send more, while the receiver thinks it told the sender it could send more. This is called deadlock. In the early days of protocol development a lot of work was done to figure out what sequence of messages was necessary to achieve synchronization. Engineers working on TCP decided that a 3-way exchange of messages (3-way handshake) could be used at the beginning of a connection. This is what is currently taught in all of the textbooks. However, in 1978 Richard Watson made a startling discovery: the message exchange was not what achieved the synchronization. It was explicitly bounding three timers. The messages are basically irrelevant to the problem. I can’t tell you what an astounding result this is. It is an amazingly deep, fundamental result – Nobel Prize level! It not only yields a simpler protocol, but one that is more robust and more secure than TCP. Other protocols, notably the OSI Transport Protocol, incorporate Watson’s result but TCP only partially does and not the parts that improves security. We have also found this implies the bounds of what is networking. If an exchange of messages requires the bounding of these timers to work correctly, it is networking or interprocess communication. If they aren’t bounded, then it is merely a remote file transfer. Needless to say, simplicity, working well under harsh conditions (or robustness), and security are all hard to get too much of.

Addressing is even more subtle and its ramifications even greater. The simple view is that if we are to deliver a message in a network, we need to say where the message is going. It needs an address, just like when you mail a letter. While that is the basic problem to be solved, it gets a bit more complicated with computers. In the early days of telephones and even data communications, addressing was not a big deal. The telephones or terminals were merely assigned the names of the wire that connected them to the network. (This is sometimes referred to as “naming the interface.”) Until fairly recently, the last 4 digits of your phone number were the name of the wire between your phone and the telephone office (or exchange) where the wire came from. In data networks, this often was simply assigning numbers in the order the terminals were installed.

But addressing for a computer network is more like the problem in a computer operating system than in a telephone network. We first saw this difference in 1972. The ARPANET did addressing just like other early networks. IMP addresses were simply numbered in the order they were installed. A host address was an IMP port number, or the wire from the IMP to the host. (Had BBN give a lot of thought to addressing? Not really. After all this was an experimental network. The big question was, would it work at all!!?? Let alone could it do fancy things! Believe me, just getting a computer that had never been intended to talk to another computer to do that was a big job. Everyone knew that addressing issues were important, difficult to get right, so a little experience first would be good before we tackled them.) Heck, the maximum number of hosts was only 64 in those days.)

In 1972, Tinker AFB joined the ‘Net and wanted two connections to the ARPANET for redundancy! My boss told me this one morning, and I first said, ‘Great! Good ide . . . ‘ I didn’t finish it and instead, I said, O, cr*p! That won’t work! (It was a head slap moment!) 😉 And a half second after that said, ‘O, not a big deal, we are operating system guys, we have seen this before. We need to name the node.’

Why wouldn’t it work? If Tinker had two connections to the network, each one would have a different address because they connected to different IMPs. The host knows it can send on either interface, but the network doesn’t know it can deliver on either one. To the network, it looks like two different hosts. The network couldn’t know those two interfaces went to the same place. But as I said, the solution is simple: the address should name the node, not the interface.[2]

Just getting to the node is not enough. We need to get to an application on the node. So we need to name the applications we want to talk to as well. Moreover, we don’t want the name of the application to be tied to the computer it is on. We want to be able to move the application and still use the same name. In 1976, John Shoch put this into words as: application names indicate what you want to talk to; network addresses indicate where it is; and routes tell you how to get there.

The Internet still only has interface addresses. They have tried various work-arounds to solve not having two-thirds of what is necessary. But like many kludges, they only kind of work, as long as there aren’t too many hosts that need it. They don’t really scale. But worse, none of them achieve the huge simplification that naming the node does. These problems are as big a threat to the future of the Internet as the congestion control and security problems. And before you ask, no, IPv6 that you have heard so much about does nothing to solve them. Actually from our work, the problem IPv6 solves is a non-problem, if you have a well-formed architecture to begin with.

The biggest problem is router table size. Each router has to know where next to send a packet. For that it uses the address. However for years, the Internet continued to assign addresses in order. So unlike a letter where your local post office can look at the State or Country and know which direction to send it, the Internet addresses didn’t have that property. Hence, routers in the core of the ‘Net needed to know where every address went. As the Internet boom took off that table was growing exponentially and was exceeding 100K routes. (This table has to be searched on every packet.) Finally in the early 90s, they took steps to make IP addresses more like postal addresses. However, since they were interface addresses, they were structured to reflect what provider’s network they were associated with, i.e. the ISP becomes the State part of the address. If one has two interfaces on different providers, the problem above is not fixed. Actually, it needs a provider-independent address, which also has to be in the router table. Since even modest sized businesses want multiple connections to the ‘Net, there are a lot of places with this problem and router table size keeps getting bigger and bigger, now around 500K and 512K is an upper bound that we can go beyond, but it impairs adoption of IPv6 to do so. In the early 90s, there was a proposal[3] to name the node rather than the interface. But the IETF threw a temper tantrum refused to consider breaking with tradition. Had they done that it would have reduced router table size by a factor of between 3 and 4, so router table size would be closer to 150K. In addition, naming the interface only makes doing mobility a complex mess.

Emerson: I see – so, every new “fix” to make the Internet work more quickly and efficiently is only masking the fundamental underlying problems with the architecture itself. What is the last flaw in TCP you’d like to touch on before we wrap up?

Day: Well, I wouldn’t say ‘more quickly and efficiently.’ We have been throwing Moore’s Law at these problems: processors and memories have been getting faster and cheaper faster than the Internet problems have been growing, but that solution is becoming less effective. Actually, the Internet is becoming more complex and inefficient.

But as to your last question, another flaw with TCP is that it has a single message type rather than separating control and data. This not only leads to a more complex protocol but greater overhead. They will argue that being able to send acknowledgements with the data in return messages saved a lot of bandwidth. And they are right. It save about 35% bandwidth when using the most prevalent machine on the ’Net in the 1970s, but that behavior hasn’t been prevalent for 25 years. Today the savings are miniscule. Splitting IP from TCP required putting packet fragmentation in IP, which doesn’t work. But if they had merely separated control and data it would still work. TCP delivers an undifferentiated stream of bytes which means that applications have to figure out what is meaningful rather than delivering to a destination the same amount the sender asked TCP to send. This turns out to be what most Applications want. Also, TCP sequence numbers (to put the packets in order) are in units of bytes not messages. Not only does this mean they “roll-over” quickly, either putting an upper bound on TCP speed or forcing the use of an extended sequence number option which is more overhead. This also greatly complicates reassembling messages, since there is no requirement to re-transmit lost packets starting with the same sequence number.

Of the 4 protocols we could have chosen in the late 70s, TCP was (and remains) the worse choice, but they were spending many times more money than everyone else combined. As you know, he with the most money to spend wins. And the best part was that it wasn’t even their money.

Emerson: Finally, I wondered if you could briefly talk about RINA and how it could or should fix some of the flaws of TCP you discuss above? Pragmatically speaking, is it fairly unlikely that we’ll adopt RINA, even though it’s a more elegant and more efficient protocol than TCP/IP?

Day: Basically RINA picks up where we left off in the mid-70s and extends what we were seeing then but hadn’t quite recognized. What RINA has found is that all layers of the same functions they just are focused on different ranges of the problem space. So in our model there is one layer that repeats over different scopes. This by itself solves many of the existing problem of the current Internet, including those described here. But in addition, it is more secure as multihoming and mobility falls out for free. It solves the router table problem because the repeating structure allows the architecture to scale, etc.

I wish I had a dollar for every time someone has said (in effect), “gosh, you can’t replace the whole Internet.” There must be something in the water these days. They told us that we would never replace the phone company, but it didn’t stop us and we did.

I was at a high-powered meeting a few weeks ago in London that was concerned about the future direction of architecture. The IETF [Internet Engineering Task Force] representative was not optimistic. He said that within 5-10 years, the number of Internet devices in the London area would exceed the number of devices on the ‘Net today, and they had no idea how to do the routing so the routing tables would converge fast enough.

My message was somewhat more positive. I said, I have good news and bad news. The bad news is: the Internet has been fundamentally flawed from the start. The flaws are deep enough that either they can’t be fixed or the socio-political will is not there to fix them. (They are still convinced that not naming the node when they had the chance was the right decision.) The good news is: we know the answer and how to build it, and these routing problems are easily solved.

[1] An IMP was an ARPANET switch or today router. (It stood for Interface Message Processor, but is one of those acronyms where the definition is more important than what it stood for.) NCP was the Network Control Program, that managed the flows between applications such as Telnet, a terminal device driver protocol; and FTP, a File Transfer Protocol.

[2] It would be tempting to say “host” here rather than “node,” but one might have more than one node on a host. This is especially true today with Virtual Machines so popular, each one is a node. Actually, by the early 80s we had realized that naming the host was irrelevant to the problem.

[3] Actually, it wasn’t a proposal, it was already deployed in the routers and being widely used.