Distributed Computer Systems: Backwards towards the future

David J. Farber Professor of Computer and Information Science University of Pennsylvania 200 South 33 rd Street Philadelphia PA 19104-6389 EMAIL: Farber@cis.upenn.edu Invited paper presented at the First International Symposium on Autonomou Decentralized System (ISADS) Kawasaki Japan April 1993

X-Eff_General_Info: info@eff.org
X-Pgp: Support Phil Zimmermann legal defense fund (email dubois@csn.org)
X-Perl: Support Randal Schwartz legal defense fund (email fund@stonehenge.com)
X-Censored: Support Arthur Halavais legal defense fund (email rhal@crash.cts.com)
X-Mailer: ELM [version 2.4 PL24]
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 24070     

   David J. Farber Professor of Computer and Information Science
   University of Pennsylvania 200 South 33rd Street Philadelphia PA
   19104-6389 EMAIL: Farber@cis.upenn.edu Invited paper presented at the
   First International Symposium on Autonomous Decentralized System
   (ISADS) Kawasaki Japan April 1993


The advent of gigabit network technology has inspired a rethinking of, among
other things, the structure of network computers, operating systems, and
protocol structures. This rethinking has, in the author's opinion, led to the
conclusion that a globally distributed computer system represents one of the
best applications of this new technology. This paper will examine some of the
considerations that led to this conclusion, as well as explore the nature of 
such a Global Computer.


 Initially, we will examine the way in which various components of a computer
system are affected by their use in networks that give host-to-host potential
throughput more than one gigabit per second. A short summary of this
examination follows:


We will begin by examining the bottom of the computer architecture, that is the

processor to memory bus system. This system, which in many modern
computers also carries the majority of the traffic that comes into memory via
the I/O system, provides, in its current form, a bandwidth in the range of one
gigabit per second. Thus, data coming in from the outside at one gigabit per
second will substantially conflict with the needs of the processor-memory
pipeline. A number of solutions to this problem have been suggested. The first
and most common solution are to define another bus to serve as the memory-
processor communication buses and use the main computer bus only for access
to a dual-ported or multi-ported memory for the I/O system. This creates a
number of difficult architectural problems and can notably increase the cost 
of the computer system. Another approach, and the one that the author suggests,
is to treat the network as part of the path to a global multi-processo
environment. In this view, the processor memory path is intrinsically the same
as the processor network path and the problem then reduces to an issue of how
to avoid multi-memory bank conflicts on the bus. However, this is an open and
confused area. In the short term, it is probably the limiting obstruction to 
the widespread use of workstations on high speed networks.

Moving up the hierarchy to the computer-network interface, one encounters
more problems. In order to get information into the computer from an external
device, it is necessary to define some form of interface card which maps from
the internal structure of the computer system to the format of the network.
Currently, cards are being developed which translate buffers of data into ATM
format and then interconnect through SONET chips to the gigabit network
Experience at the University of Pennsylvania has shown that, in many
workstations, the I/O controllers are too slow to service the bandwidth
requirements of such interface cards. This compounds the bus congestion
problem and again creates a major bottleneck. In the future, controller chips
will help alleviate this problem. For example, the University of Pennsylvania
ATM interface board is capable of operating at 650 megabits per second, but is
limited in the case of the RS-6000 to 135 megabits per second. Future versions
of the RS-6000 I/O controller will allow speeds up to 600,000 megabits per
second, but still not one gigabit.

Even if information could be brought into a computer's physical memory fast
enough, our problems have just begun. Modern operating systems, such as
UNIX, have grown in size and complexity to the point that they create a major
bottleneck for the utilization of high speed networks. Even without considering
protocol processing, the manipulation of the operating system in moving data
(as it comes in from an external network until it can be delivered to a
requesting processor) involve a substantial number of memory-to-memory
moves, each consuming considerable computer time. This problem does not have
an obvious solution. Rather, it suggests an approach that could be described as
a "kinder, gentler" operating system, which goes back to the original 
simplicity of the early UNIX systems as created at Bell Laboratories. 
More will be said about this later in the paper.

Moving up one more level in the pipeline of bottlenecks, we run into the 
protocol system. Modern protocol systems such as TCP/IP have evolved in 
response to slow-speed communication systems operating in the low one 
megabit range. These systems have been characterized by high error rates 
and modest bandwidth latency metrics. In modern networks, one may have 
hundreds of thousands of bits flowing across the United States in 
response to a request for information transfer. Protocol systems employ 
windowing mechanisms to deal with the management of such latency. In the 
case of gigabit networks, there will be millions of bits flowing in the pipe 
if we take a similar approach to protocol design. This amount of data 
would put an inordinate strain on any rational windowing system and, more 
seriously, make intelligent anticipation strategies extremely difficult. 
That is, it would be difficult for the system to know what to ask for so 
that the right information gets to the requesting post at the right
time. Even if we could solve, in a simple way, the above problems, the
complexity of the resulting protocol systems would, when operating at the
speeds anticipated in the future, require most, if not all, of the computing
cycles available in modern workstations. In this case, only the fastest
supercomputers would have the processing power to service such protocol
systems. If one presumes that the purpose of computing is to compute and not
to run protocols, then something is fundamentally wrong. Moving up one 
final step, we turn to the switching systems that will, in the future, 
support very high speed networking. In spite of everything that has
been said about the convergence of communications and computing, we still
build networks, be they local or national, in a way that isolates them from the
computers that are their customers. In a sense, the computers are like
telephone subscribers dialing connections through the network, with the
network refusing and unable to help in any of the transactions other that
merely establishing connections. This isolation creates additional burdens on 
the computer systems. A richer interaction between computer and network--
indeed, a blurring of the distinction between computer and network--is

In 1970, an experiment, called the DCS project attempted to create, by
interconnecting a number of minicomputers, a system that would offer
seamless user access to the computer system resources, fault tolerance,
incremental expansion, and, most importantly, would look like a single
computer to the using software. The resulting system, which went operational
in 1973, utilizing a PDP-11-like system, contained a number of innovative
ideas in its structure. These included one of the first micro-kernels, a fully
distributed fault tolerant file system, an inter-process communication
structure based on messaging, and a process-migration structure supported, as
was the inter-process communication structure, by the communication
subsystem. The communication subsystem was the first operational token-ring,
which, in almost all its details, preceded and formed the basis for the 805.6
token-ring. It included a mechanism within the communication system that knew
about process names, allowing messages to processes to be sent around the
ring. In addition, the system could, by examining interfaces external to the
computing system, determine if the processes addressed by these messages
were resident in the attached processor. If a process was resident, a copy of
the message was made and an appropriate indicator was put on the message,
which continued around the ring. Resource management, process migration, and
process binding all took advantage of the name tables that were in the ring. 
In an appendix to this paper, we have placed a short description of the DCS
system. It is put here not to boast or claim priority, but to indicate that 
many of the ideas we will discuss later in this paper were around in the 
early 70's and are still reasonable ideas, as they have been for the past 
twenty years.

The Global Computer

One of the hopes for the DCS was that by adding mechanisms to the
communication system, enough assistance would be provided to the computer
software system to allow it to operate in an efficient fashion. Thus, we
expected that by putting multiple processors on the ring, performance that
approached that of the sum of the processors would be achieved. This turned
out not to be quite true for a number of reasons, the most important of which
was that the processors were burdened by the need to repeatedly copy
information coming in from the ring interfaces, through their I/O system, into
the memory system. Thus, DCS suffered from essentially the same problem we
see in gigabit networks.

Five years ago, when reflecting on the experience gained from DCS and from a
companion system developed in the early 80's at the University of Delaware
called SODS (Series One Distributed System), it was realized that if one
continued to view local networks, such as the ring, as communications system
that were serviced through the I/O mechanisms of computers, it would be
difficult, if not impossible, to achieve greatly improved performance. In
response, a project called MEMNET, which is described in a dissertation by
Gary Delp, was started. This project took the point of view that the memory
system is a more natural vehicle for communication between components of
distributed system than is a message-based system. In the MEMNET system,
we used special memory cards, one per processor. These cards were coupled
together through a 230 megabit per second insertion ring. When a processor
made a request to memory, the MEMNET card behaved like any other memory
card in the processor. The MEMNET card contained a large memory cache which
examined when it saw an address on the bus to see whether the item was in the
cache. If it was not, the processor was held and a request was made around the
ring to all MEMNET cards to see where the requested memory object was. This
object would be brought, through the ring, into the cache of the MEMNET card
and then the processor request would be honored. Since the ring was a
broadcast ring, it essentially operated as a snooping cache, preserving
ownership and providing proper data consistency. Experiences with MEMNET
indicated that this approach is valid and that its performance is scalable up 
to a significantly larger number of processors

The Next Step

When one examines the gigabit world, one can make a number of interesting
observations. A gigabit network has bandwidth equivalent to a local processor
bus. Thus, one is tempted to talk about a national MEMNET. An immediate
reaction to this is, "Yes, but how can we have a processor waiting for a
memory request that must travel across the United States to be satisfied?"
This situation is equivalent to saying that certain requests to memory might
have long latencies before they are satisfied. This sounds familiar. It is the
same behavior that a modern virtual memory paged computing system sees in
normal operation. When a processor makes a request to memory for something
from a page in such a system, often it gets it immediately, but occasionally it
gets a page fault and the process is held waiting while the software syste
finds the page, brings it into memory, and then restarts the requested memory
access. The page retrieval time is remarkably close to the time required for a
transcontinental request for a missing object. Our modern computer system
have been fine-tuned to operate efficiently in such a paged environment, and,
indeed, there are a number of multi-processor virtual memory computer
systems on the market.

Another observation one might make is that it is easy to find a missing page in
a local environment and considerably more difficult to find that page (or 
object) when it is scattered among the geographically dispersed 
components of the Global Computer. Indeed, this is a valid observation. 
The response to this is to call for the communication system to help us. 
The communication system must take part in both finding the location of 
requested objects and, what is more important, in keeping track of the 
migration of these objects as they move around the Global Computer. These 
issues are dealt with in the CAPNET proposal of Ivan Tam, in which a set 
of tables was to be added to each of the switches in a modern gigabit 
network. These tables are similar, but not identical, to the page tables 
in our modern computers. When a request for an item comes into the
switch, a look-up is done in the page table to determine how to route the
request based on whether the switch has seen the requested page go through it.
This is similar in principal to the process-tables of the DCS system. Somewhat
remarkably, it is also similar to the mechanisms that must be put into
switching fabrics to support the future mobile personal communications
systems. A person using a personal communicator would be free to migrate
anyplace within the United States and could be called by an originating person
who knows only his "name." The network would search to see where that
communicator was last seen and would leave information within the switching
fabrics so that a global search would not have to be done each time an attempt
was made to connect to that person. As the person travels, the tables are kept
up to date to minimize wasted searching. If one substitutes the word memory-
object for personal communications system, one has essentially the same
structure, except for some "minor" details. This observation makes it feasible
to discuss the actual deployment of hardware at the switching nodes that will
support a Global Computer as well as personal communicators.

Considerations of protection and data consistency suggest that, in practice, 
the items that are communicated within the Global Computer are not pages, but
rather are objects that can provide ownership and access-protection
mechanisms. This evolution of CAPNET is described in a paper by H. Kim called
GOBNET, A Gigabit Object-oriented Network.

Given the structure outlined above, we now have all the mechanisms in place to
build the Global Computer. When one views this system from a software
perspective, an interesting thing takes place. The Global Computer viewed from
almost every level of software is nothing more than a multi-processor, virtual
memory (possibly object-oriented) computer system. Protocols are no longer
an issue, except for the very simplest and those of the presentation layer.
There are other issues that must be dealt with as well. For example, we must
define appropriate memory cards that will, among other things, cache items
that are moving around the national bus and make requests through the network
for needed objects. Also, If, in fact, we are to build a distributed machine of
heterogeneous elements, then we must pay attention to the issues of data
mapping, etc. This, however, is an issue of heterogeneous systems, not of a
particular distribution design.

Have we gotten away free? We have made it sound all too simple. Essentially,
we have argued that ideas developed in the past twenty years for managing
multi-processor machines and very large data bases are completely mappable
onto the Global Computer. If one were a harsh critic of these ideas, one may
ask, "Yes, but have you solved the latency problem? That is, every time you
make a request outside your local environment it will take a long time for that
object to get to your machine." The answer is, of course, not simple. On on
hand, the problems are no different than those one encounters in the design of
a paged environment. Unless data-structures are properly structured, a paged
environment can collapse into a mass of long delays. The normal solution to 
this problem is to utilize an anticipation mechanism which will pre-page 
objects which must be used often and/or fix them in real local memory. 
Some of the twenty years of research has been towards gaining a deeper 
understanding of what these anticipation strategies should be within a 
local environment. A thesis just completed by Joseph Touch at the 
University of Pennsylvania addressed these issues as they extend to the 
Global Computer. He suggests in the thesis that the limiting performance 
property of a gigabit network will be determined by its ability to 
anticipate what the needs are for data and to move the proper data 
elements in advance of requests. Exploration of how or if the switching 
system can help in this anticipation mechanism is under study at this time.

A final comment

While it may appears to the reader that we have left out much prior work in
shared memory systems, let us assure you that we have factored into our
efforts the work of those who have preceded us. To quote Richard Hamming,
and Galois: "We have stood on the shoulders of those who preceded us." The 
paper "Memory as a Network Abstraction" cited in the references covers this 
prior work in detail.


Some History

In this appendix, we will briefly give some technical details of the DCS effort
that was undertaken at the University of California, Irvine (UCI) in the 
1970 to 1976 era.

First the DCS System. This work was sponsored by the National Science
Foundation starting in 1970. The following material is taken from "Experience
with the Distributed Computer System (DCS)" by David J. Farber and Paul V.
Mockapetris which was issued in 1976 as UCI Technical Report 116.
The Distributed Computer System (DCS) was a distributed timesharing system
developed at the University of California, Irvine. The system consisted of
multiple minicomputers coupled by a ring communication system. The system
was used as a vehicle for research and development of distributed computer
systems, to support a class in systems design, and as a production system in a
modest, but growing, sense.

The design objectives of the system are coherence, high availability through
fail-soft behavior, and low cost. Enhanced system availability was achieved
through the distribution of hardware, software, and control. The distribution
discipline of DCS also made changes in system configuration natural. The 
failure of a redundant hardware or software network component was isolated; 
the remaining components continue to function. Mechanisms for the detection 
and the restart of the failed components allow the construction of fault 
tolerant services. The paper described the message functions and 
techniques that are used in the DCS and points out some possible 
evolutionary trends in message based distributed systems. The DCS evolved 
out of a desire to explore construction of a coherent distributed environment.

The original DCS design [David J. Farber, "A Distributed Computer System"
UCI TR 4, Sept. 1970] aimed at producing a general purpose timesharing
environment. The system that evolved has the following main features:

1. The system has 3 identical 16 bit minicomputers as its main processing
elements. Other machines of different types are essential for full operation of
the system, but are integrated into the system in a restricted manner. The DCS
system does not completely address the problems of a heterogeneous processor

2. The hosts in the DCS system are connected by a unidirectional data ring
running at 2.2 Megabit. The ring interface (RI) supports high level protocols
such as addressing by process name and implicit acknowledgments at the
hardware level

3. The system has a simple process structure that provides a coheren
environment. The system interface used by processes has been kept simple. A
user application consists of one or more user processes which communicate
between themselves and access system resources via messages. Such an
amalgamation is called a process net.

4. Interprocess communication is allowed only via the message system. Hence
the physical location of processes is never a constraint.
5. The operating system kernel in each machine is small and simple. All
machines on the system run identical kernels.

6. Most conventional operating system functions have been moved into
autonomous server processes. These servers provide I/C, etc. via message
protocols to the rest of the system.

The DCS system was fully operational in the mid 1970's, featured in addition to
the above process migration and motivated additional research and exploration
in the US and Japan. In addition, the ring system formed the technical basis 
for the Proteon and IBM Token Rings and latter the IEEE standard.

References for DCS
David J. Farber, "A Distributed Computer System", UCI TR 4, Sept. 1970

David J. Farber and John Pickens, "The Overseer", Proceedings of the
International Computer Communications Conference 197

Paul V. Mockapetris, Michael R. Lyle, and David J. Farber, "On the 
Design of Local Network Interfaces", Information Processing 77, 
Proceedings of the IFIPS 77 Congress

References for the Global Computer

David J. Farber, "A Tale of Two Major Networking Problems - one
Organizational and one Technical", invited article for The Harvard 
Information Quarterly - Fall 1989.

Gary Delp, David Farber, Ronald Minnich, Jonathan M. Smith, and Ming-Chit
Tam, "Memory as a Network Abstraction", IEEE Network, Vol. 5(4), pp. 34-41,
(also appears in an IEEE CS Press book on Distributed Computing Systems)
(July, 1991).

David D. Clark, Bruce S. Davie, David J. Farber, Inder S. Gopal, Bharath K
Kadaba, W. David Sincoskie, Jonathan M. Smith, and David L. Tennenhouse, "The
AURORA Gigabit Testbed," Computer Networks and ISDN Systems, (1991).

Jonathan M. Smith and David J. Farber, "Traffic Characteristics of a 
Distributed Memory System", Computer Networks and ISDN Systems, Vol. 
22(2), pp 143-154 (September 1991).

D. D. Clark, B. S. Davie, D. J. Farber, I. S. Gopal, B. K. Kadaba, W. D. 
Sincoskie, J. M. Smith, and D. L. Tennenhouse, "An Overview of the AURORA 
Gigabit Testbed", in Proceedings, INFOCOM 1992, Florence, ITALY (1992).

C. Brendan S. Traw and Jonathan M. Smith, "Implementation and Performance of
an ATM Host Interface for Workstations" in Proceedings, IEEE Workshop on the
Architecture and Implementation of High-Performance Communications
Subsystems (HPCS '92), Tucson, AZ (February 17-19, 1992)..

Jonathan M. Smith, "Protection in Distributed Shared Memories", in
Proceedings, 4th International Workshop on Distributed Environments and
Networks, Tokyo, JAPAN (October 28-31 1991). Invited Paper

Ming-Chit Tam, David J. Farber "CapNet - An Alternate Approach To Ultra-high
Speed Networks", International Communications Conference, April 90, Atlanta

Ming-Chit Tam, Jonathan Smith, David J. Farber. "A Taxonomy Comparison of
Several Distributed Shared Memory Systems", ACM Operating Systems Review,
June 1990

Joseph D. Touch and David J. Farber ."Mirage: A Model for Ultra High-Speed
Protocol Analysis and Design", Proceedings of the IFIPS WG 6.1/WG 6.
Workshop on Protocols for High-Speed Networks, Zurich, Switzerland, 9-11
May 1989

Ronald G. Minnich and Dave Farber. "The Mether System: A Distributed Shared
Memory for SunOS 4.0", Usenix- Summer 89

Ronald G. Minnich and David J. Farber. "Reducing Host Load, Network Load, and
Latency in a Distributed Shared Memory", Proceedings of the Tenth {IEEE}
Distributed Computing Systems Conference 1990


HTML markup by Brad Cox (bcox@gmu.edu)