This paper appears as Chapter 1 in the book Distributed Systems,
edited by Sape Mullender and published by Addison-Wesley and ACM Press
in 1993, 1994, and 1995.

Copyright 1993 by the ACM Press.


 

A State-of-the-Art Distributed System: Computing with BOB


Michael D. Schroeder
DEC Systems Research Center
130 Lytton Ave
Palo Alto, CA 94301

mds@pa.dec.com



Distributed systems are a popular and powerful computing paradigm. 
Yet existing examples have serious short-comings as a base for general-purpose 
computing. This chapter explores the characteristics, strengths, 
and weaknesses of distributed systems. It describes a model for a 
state-of-the-art distributed system that can do a better job of supporting 
general-purpose computing than existing systems. This model system 
combines the best features of centralized and networked systems, 
with improved availability and security. The chapter concludes by 
outlining the technical approaches that appear most promising for 
structuring such a system.

This chapter does not specifically discuss application-specific distributed 
systems, such as automated banking systems, control systems for roaming 
and tracking cellular telephones, and retail point-of-sales systems, 
although there are many economically important examples. The issues 
raised in the chapter do apply to such systems and the model system 
described here should be a suitable base for many of them. Some systems 
that are designed to support a narrow set of uses, however, may need 
to be structured in application-specific ways that can be simpler 
and more efficient for those uses. The extent to which special-purpose 
systems should be built on a general-purpose distributed base needs 
to be investigated further. This chapter also does not address real-time 
distributed systems, such as control systems for factories, aircraft, 
or automobiles, which face distinctive scheduling and resource utilization 
requirements. 


1. Characteristics of Distributed Systems

A distributed system is several computers doing something together. 
Thus, a distributed system has three primary characteristics. 
 
o  Multiple Computers -- A distributed system contains more than 
   one physical computer, each consisting of cpu's, some local memory, 
   possibly some stable storage like disks, and I/O paths to connect 
   it with the environment. 

o  Interconnections -- Some of the I/O paths will interconnect the 
   computers. If they cannot talk to each other, then it is not going 
   to be a very interesting distributed system. 
   
o  Shared State -- The computers cooperate to maintain some shared 
   state. Put another way, if the correct operation of the system 
   is described in terms of some global invariants, then maintaining 
   those invariants requires the correct and coordinated operation 
   of multiple computers. 

Building a system out of interconnected computers requires that three 
major issues be addressed. 

o  Independent Failure -- Because there are several distinct computers 
   involved, when one breaks others may keep going. Often we want 
   the "system" to keep working after one or more have failed. 
   
o  Unreliable Communication -- Because in most cases the interconnections 
   among computers are not confined to a carefully controlled environment, 
   they will not work correctly all the time. Connections may be 
   unavailable. Messages may be lost or garbled. One computer cannot 
   count on being able to communicate clearly with another, even 
   if both are working. 
  
o  Insecure Communication -- The interconnections among the computers 
   may be exposed to unauthorized eavesdropping and message modification. 

A centralized system that supports multiple processes and provides 
some form of interprocess communication, e.g. a Unix timesharing 
system, can exhibit in virtual form the three primary characteristics 
of a distributed system. A centralized system may even manifest independent 
failure, e.g. the demon process for mail transport may crash without 
stopping interactive user processes on the same system. Thus, design 
and programming techniques associated with communicating sequential 
processes in centralized systems form part of the basic techniques 
in the distributed systems arena. However, centralized systems are 
usually successful without dealing with independent failure and never 
need to confront unreliable and insecure communications. In a distributed 
system all three issues must be addressed. 

The canonical example of a general-purpose distributed system today 
is a networked system -- a set of workstations/PCs and servers interconnected 
with a network. 


2. Networked vs Centralized Systems

It is easy to understand why networked systems are popular. Such 
systems allow the sharing of information and resources over a wide 
geographic and organizational spread. They allow the use of small, 
cost-effective computers and get the computing cycles close to the 
data. They can grow in small increments over a large range of sizes. 
They allow a great deal of autonomy through separate component purchasing 
decisions, selection of multiple vendors, use of multiple software 
versions, and adoption of multiple management policies. Finally, 
they do not necessarily all crash at once. Thus, in the areas of 
sharing, cost, growth, and autonomy, networked systems are better 
than traditional centralized systems as exemplified, say, by timesharing. 

On the other hand, centralized systems do some things better than 
today's networked systems. All information and resources in a centralized 
system are equally accessible. Functions work the same way and objects 
are named the same way everywhere in a centralized system. And a 
centralized system is easier to manage. So despite the advantages 
of networked systems, centralized systems are often easier to use 
because they are more accessible, coherent, and manageable. 

In the areas of security and availability, the comparison between 
networked systems and centralized systems produces no clear-cut advantage 
for either. 

2.1 Security

In usual practice to date, neither centralized nor networked systems 
offer real security, but for different reasons. 

A centralized system has a single security domain under the control 
of a single authority. The trusted computing base is contained in 
a single operating system. That operating system executes on a single 
computer that can have a physically secure environment. With all 
the security eggs in one basket, so to speak, users understand the 
level of trust to assign to the system and who to speak to when problems 
arise. On the other hand, it is notoriously difficult to eliminate 
all the security flaws from an operating system or from the operating 
environment, and with a single security domain one such flaw can 
be exploited to break the security of the entire system. 

Networked systems have multiple security domains and thus exhibit 
the inverse set of security properties. The trusted computing base 
is scattered among many components that operate in environments with 
varying degrees of physical security, differing security policies, 
and possibly under different authorities. The interconnections among 
ther computers are physically insecure. It is hard to know what is 
being trusted and what can be trusted. But, because the system contains 
many computers, exploiting a security flaw in the software or environment 
of one computer does not automatically compromise the entire system. 

2.2 Availability

A similar two-sided analysis applies to availability. A centralized 
system can have a controlled physical and operational environment. 
Since a high proportion of system failures are the result of operational 
and environmental factors, careful management of this single environment 
can produce good availability. But when something does go wrong 
the whole system goes down at once, stopping all users from getting 
work done. 

In a networked system the various computers fail independently. However, 
it is often the case that several computers must be in operation 
simultaneously before a user can get work done, so the probability 
of the system failing is greater than the probability of one component 
failing. This increased probability of not working, compared to a 
centralized system, is the result of ignoring independent failure. 
The consequence is Leslie Lamport's definition of a distributed system: 
"You know you have one when the crash of a computer you've never 
heard of stops you from getting any work done." 

On the other hand, independent failure in a distributed system can 
be exploited to increase availability and reliability. When independent 
failure is properly harnessed by replicating functions on independent 
components, multiple component failures are required before system 
availability and reliability suffer. The probability of the system 
failing thus can be less than the probability in a centralized system. 
Dealing with independent failure to avoid making availability worse, 
or even to make it better, is a major task for designers of distributed 
systems. 

A distributed system also must cope with communication failures. 
Unreliable communication not only contributes to unavailability, 
it can lead to incorrect functioning. A computer cannot reliably 
distinguish a down neighbor from a disconnected neighbor and therefore 
can never be sure an unresponsive neighbor has actually stopped. 
Maintaining global state invariants in such circumstances is tricky. 
Careful design is required to actually achieve correct operation 
and high availability using replicated components. 

2.3 A State-of-the-Art Distributed System

It seems feasible to develop a distributed system that combines the 
accessibility, coherence, and manageability advantages of centralized 
systems with the sharing, growth, cost, and autonomy advantages of 
networked systems. If real security and high availability were added 
to the mix, then we would have a state-of-the-art computing base 
for many purposes. Achieving this combination of features in a single 
system is the central challenge of supporting general-purpose computing 
well with a distributed system. No existing system fulfills this 
ideal. 


3. The Properties and Services Model

We can describe this best-of-both-worlds (BOB) distributed computing 
base in terms of a model of its properties and services. This more 
technical description of the goals provides a structure that will 
help us to understand the mechanisms needed to achieve them. 
 
The properties and services model defines BOB as: 

o  a heterogeneous set of hardware, software, and data components; 

o  whose size and geographic extent can vary over a large range; 

o  connected by a network; 

o  providing a uniform set of services (naming, remote invocation, 
   user registration, time, files, etc); 
     
o  with certain global properties (names, access, security, management 
   and availability). 

Because we are talking about a base for general-purpose computing, 
the model is defined in terms most appropriate to understanding by 
the programmers who develop components that are part of the base 
and who develop the many different applications that are to be built 
on top of it. But the fundamental coherence provided by the model 
will show through such components and applications (when they are 
correctly implemented) to provide a coherent system as viewed by 
its human users too. 

The coherence that makes BOB a system rather than a collection of 
computers is a result of its uniform services and global properties. 
The services are available in the same way to every part of the system 
and the properties allow every part of the system to be viewed in 
the same way, regardless of system size. Designers are well aware 
of the care that must be taken to produce implementations that can 
support growth to very large sizes (tens or hundreds of thousands 
of nodes). A similar challenge exists in making such expandable systems 
light-weight and simple enough to be suitable for very small configurations 
too. 

BOB's coherence does not mean that all the components of the system 
must be the same. The model applies to a heterogeneous collection 
of computers running operating systems such as Unix, VMS, MS-DOS, 
Windows NT, and others. In short, all platforms can operate in this 
framework, even computers and systems from multiple vendors. The 
underlying network can be a collection of local area network segments, 
bridges, routers, gateways, and various types of long distance services 
with connectivity provided by various transport protocols. 

3.1 Properties

What do BOB's pervasive properties mean in more detail? 

o  Global names -- the same names work everywhere. Machines, users, 
   files, distribution lists, access control groups, and services 
   have full names that mean the same thing regardless of where in 
   the system the names are used. For instance, Butler Lampson's 
   user name might be something like /com/dec/src/bwl throughout 
   the system. He will operate under that name when using any computer. 
   Global naming underlies the ability to share things. 

o  Global access -- the same functions are usable everywhere with 
   reasonable performance. If Butler sits down at a machine when 
   visiting in California, he can do everything there that he can 
   do when in his usual office in Massachusetts, with perhaps some 
   performance degradations. For instance, from Palo Alto Butler 
   could command the local printing facilities to print a file stored 
   on his computer in Cambridge. Global access also includes the 
   idea of data coherence. Suppose Butler is in Cambridge on the 
   phone to Mike Schroeder in Palo Alto and Butler makes a change 
   to a file and writes it. Mike should be able to read the new version 
   as soon as Butler thinks he has saved it. Neither Mike nor Butler 
   should have to take any special action to make this possible. 

o  Global security -- the same user authentication and access control 
   work everywhere. For instance, Butler can authenticate himself 
   to any computer in the system; he can arrange for data transfer 
   secure from eavesdropping and modification between any two computers; 
   and assuming that the access control policy permits it, Butler 
   can use exactly the same mechanism to let the person next door 
   and someone from another site read his files. All the facilities 
   that require controlled access (logins, files, printers, management 
   functions, etc.) use the same machinery to provide access control. 

o  Global management -- the same person can manage components anywhere. 
   Obviously one person will not manage all of a large system. But 
   the system should not impose a priori constraints on which set 
   of components a single person can manage. All of the components 
   of the system provide a common interface to management tools. 
   The tools allow a manager to perform the same action on large 
   numbers of components at once. For instance, a single system manager 
   can configure all the workstations in an organization without 
   leaving his office. 

o  Global availability -- the same services work even after some 
   failures. System managers get to decide (and pay for) the level 
   of replication for each service. As long as the failures do not 
   exceed the redundancy provided, each service will go on working. 
   For instance, a group might decide to duplicate its file servers 
   but get by with one printer per floor. System-wide policy might 
   dictate a higher level of replication for the underlying communication 
   network. Mail does not have to fail between Palo Alto and Cambridge 
   just because some machine goes down in Lafayette, Indiana. 

3.2 Services

The standard services defined by BOB include the following fundamental 
facilities: 

o  Names -- provides access to a replicated, distributed database 
   of global names and associated values for machines, users, files, 
   distribution lists, access control groups, and services. A name 
   service is the key BOB component for providing global names, although 
   most of the work involved in implementing global names is making 
   all the other components of the distributed system, e.g. existing 
   operating systems, use the name service in a consistent way. 

o  Remote procedure call -- provides a standard way to define and 
   securely invoke service interfaces. This allows service instances 
   to be local or remote. The RPC mechanism can be organized to operate 
   by dynamically choosing one of a variety of transport protocols. 
   Choosing RPC for the standard service invocation mechanism does 
   not force blocking call semantics on all programs. RPC defines 
   a way to match response messages with request messages. It does 
   not require that the calling program block to await a response. 
   Methods for dealing with asynchrony inside a single program are 
   a local option. Blocking on RPC calls is a good choice when the 
   local environment provides multiple threads per address space. 

o  User registrations -- allows users to be registered and authenticated 
   and issues certificates permitting access to system resources 
   and information. 

o  Time -- distributes consistent and accurate time globally. 

o  Files -- provides access to a replicated, distributed, global 
   file system. Each component machine of BOB can make available 
   the files it stores locally through this standard interface. The 
   multiple file name spaces are connected by the name service. The 
   file service specification should include standard presentations 
   for the different VMS, Unix, etc. file types. For example, all 
   implementations should support a standard view of any file as 
   an array of bytes. 

o  Management -- provides access to the management data and operations 
   of each component. 

In addition to these base level facilities, BOB can provide other 
services appropriate to the intended applications, such as:

o  Records -- provides access to records, either sequentially or 
   via indexes, with record locking to allow concurrent reading and 
   writing, and journaling to preserve integrity after a failure. 

o  Printers -- allows printing throughout the network of documents 
   in standard formats such as Postscript and ANSI, including job 
   control and scheduling. 

o  Execution -- allows programs to be run on any machine (or set 
   of machines) in the network, subject to access and resource controls, 
   and efficiently schedules both interactive and batch jobs on available 
   machines, taking account of priorities, quotas, deadlines, and 
   failures. The exact configuration and utilization of cycle servers 
   (as well as idle workstations that can be used for computing) 
   fluctuates constantly, so users and applications need automatic 
   help in picking the machines on which to run. 

o  Mailboxes -- provides a transport service for electronic mail. 

o  Terminals -- provides access to a windowing graphics terminal 
   from a computation anywhere in the network. 

o  Accounting -- provides access to a system-wide collection of data 
   on resource usage which can be used for billing and monitoring. 

In many cases adequate, widely accepted standards already exist for 
the definition of the base and additional services. Each service 
must be defined and implemented to provide the five pervasive properties. 

3.3 Interfaces
  
Interfaces are the key to making BOB be a coherent and open system. 
Each of the services is defined by an interface specification that 
serves as a contract between the service and its clients. The interface 
defines the operations to be provided, the parameters of each, and 
the detailed semantics of each relative to a model of the state maintained 
by the service. The specification is normally represented as an RPC 
interface definition. Some characterizations of the performance of 
the operations must be provided (although it is not well understood 
how to provide precise performance characterizations for operations 
whose performance varies widely depending on the parameters and/or 
use history). A precisely defined interface enables interworking across 
models, versions, and vendors. Several implementations of each interface 
can exist and this variety allows the system to be heterogeneous 
in its components. In its interfaces, however, the system is homogeneous. 

It is this homogeneity that makes it a system with predictable behavior 
rather than a collection of components that can communicate. If more 
than one interface exists for the same function, it is unavoidable 
that the function will work differently through the different interfaces. 
The system will consequently be more complicated and less reliable. 
Perhaps some components will not be able to use others at all because 
they have no interface in common. Certainly customers and salesmen 
will find it much more difficult to configure workable collections 
of components, and programmers will not know what services they can 
depend upon being present. 


4.0 Achieving the Global Properties

Experience and research have suggested a set of models for achieving 
global naming, access, security, management, and availability. For 
each of these pervasive properties, we will consider the general 
approach that seems most promising. 

4.1 Naming Model

Every user and every client program sees the entire system as the 
same tree of named objects. A global name is interpreted by following 
the named branches in this tree starting from the global root. Every 
node has a way to find a copy of the root of the global name tree. 

A hierarchic name space is used because it is the only naming structure 
we know of that scales well, allows autonomy in the selection of 
names, and is sufficiently malleable to allow for a long lifetime. 
A global root for the name space is required to provide each object 
with a single name that will work from everywhere. Non-hierarchic 
links can be added where a richer naming structure is required. 

For each object type there is some service, whose interface is defined 
by BOB, that provides operations to create and delete objects of 
that type and to read and change their state. 

The top part of the naming tree is provided by the BOB name service 
and the objects near the root of the tree are implemented by the 
BOB name service. A node in the naming tree, however, can be a junction 
between the name service and some other service, e.g. a file service. 
A junction object contains: 

o  a set of servers for the named object
o  rules for choosing a server name
o  the service interface ID, e.g. "BOB File Service 2.3"
o  an object parameter, e.g. a volume identifier

To look up a name through a junction, choose a server and call the 
service interface there with the name and the object parameter. The 
server looks up the rest of the name.

The servers listed in a junction object are designated by global 
names. To call a service at a server the client must convert the 
server name to something more useful, like the network address of 
the server machine and information on which protocols to use in making 
the call. Looking up a server name in the global name tree produces 
a server object that contains: 

o  a machine name
o  a set of communication protocol names

A final name lookup maps the (global) machine name into the network 
address that will be the destination for the actual RPC to the service.

The junction machinery can be used at several levels, as appropriate. 
The junction is a powerful technique for unifying multiple implementations 
of naming mechanisms within the same hierarchic name space.

The following figure gives an example of what the top parts of the 
global name space might look like, based on the naming scheme of 
the Internet Domain Name Service. An X.500 service could also provide 
this part of the name space (or both the Internet DNS and X.500 could 
be accommodated). An important part of implementing BOB is defining 
the actual name hierarchy to be used in top levels of the global 
name space. 


   /com/dec/src/domain ... data for SRC management domain
   |   |   |   /adb : Principal=(Password=..., Mailbox=..., etc)
   |   |   |   /bwl : Principal=(Password=..., Mailbox=..., etc)
   |   |   |   /mds : Principal=(Password=..., Mailbox=..., etc)
   |   |   |   / ... other principals registered at SRC
   |   |   |    
   |   |   |   /staff : Group=(/com/dec/src/adb, /com/dec/src/bwl, etc)
   |   |   |   / ... other groups at SRC
   |   |   |    
   |   |   |   /computers/C1 : Computer=(Address=16.1.0.0, HWInfo=...,
   |   |   |                            Bootfile=..., Bootdata=...,
   |   |   |                            etc)
   |   |   |             /C2 : Computer=(Address=16.1.0.1, etc)
   |   |   |             / ... other computers
   |   |   |    
   |   |   |   /backmap/16.1.0.0:Path=/com/dec/src/computers/C1
   |   |   |           /16.1.0.1:Path=/com/dec/src/computers/C2
   |   |   |           / ... IP addresses of other computers
   |   |   |    
   |   |   |   /bin : Volume=(junction to file service)
   |   |   |   /udir : Volume=(junction to file service)
   |   |   |   / ... other volumes
   |   |   |   / ... other SRC objects
   |   |   |
   |   |   / ... other parts of DEC
   |   |
   |   / ... other commercial organizations
   |
   / ... other sectors and countries


Consider some of the objects named here. /com, /com/dec, and /com/dec/src 
are directories implemented by the BOB name service. /com/dec/src/adb 
is a registered user, also an object implemented by the name service. 
The object /com/dec/src/adb contains a suitably encrypted password, 
a set of mailbox sites, and other information that is associated 
with this system user. 

/com/dec/src/staff is a group of global names. Group objects are 
provided by BOB's name service to implement things like distribution 
lists, access control lists, and sets of servers. /com/dec/src/bin 
is a file system volume. Note that this object is a junction to the 
BOB file service. The figure does not show the content of this junction 
object, but it contains a group naming the set of servers implementing 
this file volume and rules for choosing which one to use, e.g., the 
first that responds. To look up the name /com/dec/src/bin/ls, for 
example, the operating system on a client machine traverses the path 
/com/dec/src/bin using the name service. The result at that point 
is the content of the junction object, which then allows the client 
to contact a suitable file server to complete the lookup. 

The local management autonomy provided by hierarchic names allows 
system implementors and administrators to build and use their systems 
without waiting for a planet-wide agreement to be reached about the 
structure of the first few levels of the global hierarchy. A local 
hierarchic name space can be constructed that is sufficient to the 
local need, treating the local root as a global root. Later, this 
local name space can be incorporated as a subtree in a larger name 
space. Using a variable to provide the name of the operational root 
(set to NIL at first) will ease the transition to the larger name 
space (and shorten the names that people actually use). Another technique 
is to initially name the local root with an identifier that is unlikely 
to appear in any future global root; then symbolic links in the global 
root, or special case code in the local name resolution machinery, 
can ease the transition. 

4.2 Access model

Global access means that a program can run anywhere in BOB (on a 
computer and operating system compatible with the program binary) 
and get the same result, although the performance may vary depending 
on the machine chosen. Thus, a program can be executed either on 
the user's workstation or on a fast cycle server in the machine room, 
while still accessing the same user files through the same names. 

Achieving global access requires allowing all elements of the computing 
environment of a program to be remote from the computer where the 
program is executing. All services and objects required for a program 
to run need to be available to a program executing anywhere in the 
distributed system. For a particular user, "anywhere" includes at 
least: 

o  on the user's own workstation;
o  on public workstations or compute servers in the user's domain;
o  on public workstations in another domain on the user's LAN;
o  on public workstations across a low-bandwidth WAN.

Performance in the first three cases should be similar. Performance 
in the fourth case is fundamentally limited by the WAN characteristics, 
although use of caching can make the difference small in many cases. 

In BOB, global naming and standard services exported via a uniform 
RPC mechanism provide the keys to achieving global access. All BOB 
services accept global names for the objects on which they operate. 
All BOB services are available to remote clients. Thus, any object 
whose global name is known can be accessed remotely. 

In addition, programs must access their environments only by using 
the global names of objects. This last step will require a thorough 
examination of the computing environment provided by each existing 
operating system to identify all the ways in which programs can access 
their environment. For each way identified a mechanism must be designed 
to provide the global name of the desired object. For example, in 
Unix systems that operate as part of BOB, the identities of the file 
system root directory, working directory, and the /tmp directory 
of a process must be specified by global names. Altering VMS, Unix, 
and other operating systems to accept global names everywhere will 
be a major undertaking. 

Another aspect of global access is making sure that BOB services 
have operation semantics that are location transparent. Without location 
transparency in the services it uses, a program will not get the 
same result when it runs on different computers. A service that allows 
read/write sharing of object state must provide a data coherence 
model. The model allows client programs to maintain correct object 
state and to behave in a manner that does not surprise users, no 
matter where clients and servers execute. Depending on the nature 
of the service, it is possible to trade off performance, availability, 
scale, and coherence. 

In the name service, for example, it is appropriate to increase performance, 
availability, and scalability at the expense of coherence. A client 
update to the name service database can be made by contacting any 
server. After the client operation has completed, the server propagates 
the update to object replicas at other servers. Until propagation 
completes, different clients can read different values for the object. 
This lack of coherence produces several advantages: it increases 
performance by limiting the client's wait for update completion; 
it increases availability by allowing a client to perform an update 
when just one server is accessible; and it increases scale by making 
propagation latency separate from the visible latency of the client 
update operation. For the objects that the name service will store, 
this lack of coherence is deemed acceptable given the benefits it 
produces. The data coherence model for the name service defines the 
loose coherence invariants that programmers can depend upon, thereby 
meeting the requirement of a coherence model that is insensitive 
to client and server location. 

On the other hand, the BOB file service needs to provide consistent 
write sharing, even at some cost in performance, scale, and availability. 
Many programs and users are accustomed to using the file system as 
a communication channel between programs. For example, a programmer 
may save a source file for a module from an editor and then trigger 
a recompilation and relinking on a remote cycle server. He will be 
annoyed if the program is rebuilt with an old version of the module 
because the cycle server retrieved an old, cached version of the 
file. File read/write coherence is also important among elements 
of distributed computations running, say, on multiple computers on 
the same LAN. The file system coherence model must cover file names 
and attributes as well as file data. 

There is diversity of opinion among researchers about the best consistency 
model for a general-purpose distributed file system. Some feel that 
an open/close consistency model provides the best tradeoff. With 
this model changes made by one client are not propagated until that 
client closes the file and others (re)open it. Others feel that byte-level 
write sharing is feasible and desirable. With this model clients 
share the file as though all were accessing it in the same local 
memory. Successful systems have been built using variants of both 
models. BOB can be successful based on either model. 

4.3 Security model

Security is based on three notions:

o  Authentication -- for every request to do an operation, the name 
   of the user or computer system making the request is known reliably. 
   The source of a request is called a "principal". 

o  Access control -- for every resource (computer, printer, file, 
   database, etc.) and every operation on that resource (read, write, 
   delete, etc.), it is possible to specify the names of the principals 
   allowed to do that operation on that resource. Every request for 
   an operation is checked to ensure that its principal is allowed 
   to do that operation. 

o  Auditing -- every access to a resource can be logged if desired, 
   as can the evidence used to authenticate every request. If trouble 
   comes up, there is a record of exactly what happened. 

To authenticate a request as coming from a particular principal, 
the system must determine that the principal originated the request, 
and that it was not modified on the way to its destination. We do 
the latter by establishing a "secure channel" between the system 
that originates the request and the one that carries it out. Practical 
security in a distributed system requires encryption to secure the 
communication channels. The encryption must not slow down communication, 
since in general it is too hard to be sure that a particular message 
does not need to be encrypted. So the security architecture should 
include methods of doing encryption and decryption on the fly, as 
data flows from computers into the network and back. 

To determine who originated a request, it is necessary to know who 
is on the other end of the secure channel. Usually this is done by 
having the principal at the other end demonstrate that it knows some 
secret (such as a password), and then finding out in a reliable way 
the name of the principal that knows that secret. BOB's security 
architecture needs to specify how to do both these things. It is 
best if a principal can show that it knows the secret without giving 
it away, since otherwise the system can later impersonate the principal. 
Password-based schemes reveal the secret, but schemes based on encryption 
do not. 

It is desirable to authenticate a user by his possession of a device 
which knows his secret and can demonstrate this by encryption. Such 
a device is called a "smart card". An inferior alternative is for 
the user to type his password to a trusted agent. To authenticate 
a computer system, we need to be sure that it has been properly loaded 
with a good operating system image; BOB must specify methods to ensure 
this. 

Security depends on naming, since access control identifies the principals 
that are allowed access by name. Practical security also depends 
on being able to have groups of principals e.g., the Executive Committee, 
or the system administrators for the cluster named "Star". Both these 
facilities must be provided by the name service. To ensure that the 
names and groups are defined reliably, digital signatures are used 
to certify information in the name service; the signatures are generated 
by a special "certification authority" which is engineered for high 
reliability and kept off-line, perhaps in a safe, when its services 
are not needed. Authentication depends only on the smallest sub-tree 
of the full naming tree that includes both the requesting principal 
and the resource; certification authorities that are more remote 
are assumed to be less trusted. 

Security also depends on time. Authentication, access control, and 
secure channels require correct timestamps and clocks. The time service 
must distribute time securely. 

4.4 Management model

System management is the adjustment of system state by a human manager. 
Management is needed when satisfactory algorithmic adjustments cannot 
be provided -- when human judgement is required. The problem in a 
large-scale distributed system is to provide each system manager 
with the means to monitor and adjust a fairly large collection of 
different types of geographically distributed components. Of the 
five persuasive properties of BOB, global management is the one we 
understand least well how to achieve. The facilities, however, are 
likely to be structured along the following lines. 

The BOB management model is based on the concept of domains. Every 
component in a distributed system is assigned to a domain. (A component 
is a piece of equipment or a piece of management-relevant object 
state.) Each domain has a responsible system manager. In the simple 
version of the model (that is probably adequate for most, perhaps 
all, systems) domains are disjoint and managers are disjoint, although 
more complex arrangements are possible, e.g. overlapping domains, 
a hierarchy of managers. Ideally a domain would not depend on any 
other domains for its correct operation. 

There needs to be quite a bit of flexibility available for defining 
domains, as different arrangements will be effective in different 
installations. Example domains include: 

o  components used by a group of people with common goals;
o  components that a group of users expects to find working;
o  largest pile of components under one system manager;
o  arbitrary pile of components that is not too big.

As a practical matter, customers will require guidelines for defining 
effective management domains. 

BOB requires that each component define and export a management interface, 
using RPC if possible. Each component is managed via calls to this 
interface from interactive tools run by human managers. Some requirements 
for the management interface of a component are: 

o  Remote access -- The management interface provides remote access 
   to all management functions. Local error logs are maintained that 
   can be read from the management interface. A secure channel is 
   provided from management tools to the interface and the operations 
   are subject to authentication and access control. No running around 
   by the manager is required. 

o  Program interface -- Each component's management interface is 
   designed to be driven by a program, not a person. Actual invocation 
   of management functions is by RPC calls from management tools. 
   This allows a manager to do a lot with a little typing. A good 
   management interface provides end-to-end checks to verify successful 
   completion of a series of complex actions and provides operations 
   that are independent of initial component state to make it easier 
   to achieve the desired final state. 

o  Relevance -- The management interface operates only on management- 
   relevant state. In places where the flexibility is useful rather 
   than just confusing, the management interface permits decentralized 
   management by individual users. 

o  Uniformity -- Different kinds of components should strive for 
   uniformity in their management interfaces. This allows a single 
   manager to retain intellectual control of a larger number of kinds 
   of components. 

The management interfaces and tools make it practical for one person 
to manage large domains. An interactive management tool can invoke 
the management interfaces of all components in a domain. It provides 
suitable ways to display and correlate the data, and to change the 
management-relevant state of components. Management tools are capable 
of making the same state change in a large set of similar components 
in a domain via iterative calls. To provide the flexibility to invent 
new management operations, some management tools support the construction 
of programs that call the management interfaces of domain components. 

4.5 Availability Model

To achieve high availability of a service there must be multiple 
servers for that service. If these servers are structured to fail 
independently, then any desired degree of availability can be achieved 
by adjusting the degree of replication. 

The most practical scheme for replication of the services in BOB 
is primary/backup, in which a client uses one server at a time and 
servers are arranged (as far as possible) to fail in a benign way, 
say by stopping. The alternative method, called active replication, 
has the client perform each operation at several servers. Active 
replication uses more resources than primary/backup but has no failover 
delays and can tolerate arbitrary failure behavior by servers. 

To see how primary/backup works, recall from the naming model discussion 
that the object that represents a service includes a set of servers 
and some rules for choosing one. If the chosen server (the primary) 
fails, then the client can failover to another server (the backup) 
and repeat the operation. The client assumes failure if a response 
to an operation does not occur within a timeout period. The timeout 
should be as short as possible so that the latency of an operation 
that fails  over is comparable to the usual latency. In practice, 
setting good timeouts is hard and fixed timeouts may not be adequate. 
If clients can do operations that change long-term state then the 
primary server must keep the backup servers up-to-date.

To achieve transparent failover from the point of view of client 
programs, knowledge of the multiple servers should be encapsulated 
in an agent on the client computer. In this chapter we refer to the 
agent as a clerk. The clerk can export a logically centralized, logically 
local service to the client program, even when the underlying service 
implementation is distributed, replicated, and remote. The clerk 
software can have many different structural relationships to its 
client. In simple cases it can be runtime libraries loaded into the 
client address space and be invoked with local procedure calls. Or 
it may operate in a separate address space on the same machine as 
the client and be invoked by same-machine IPC, same-machine RPC, 
or callbacks from the operating system. Or it can be in the operating 
system itself. 

The clerk interface need not be the same as the server interface. 
Indeed, the server interface usually will be significantly more complex. 
In addition to implementing server selection and failover, a clerk 
may provide caching and write behind to improve performance, and 
can implement aggregate operations that read-modify-write server 
state. As simple examples of caching, a name service clerk might 
remember the results of recently looked up names and maintain open 
connections to frequently used name servers, or a file service clerk 
might cache the results of recent file directory and file data reads. 
Write-behind allows a clerk to batch several updates as a single 
server operation which can be done asynchronously, thus reducing 
the latency of operations at the clerk interface. Implementing a 
client's read-modify-write operation might require the clerk to use 
complex retry strategies involving several server operations when 
a failover occurs at some intermediate step. 

As an example of how a clerk masks the existence of multiple servers, 
consider the actions involved in listing the contents of 
/com/dec/src/udir/bwl/Mail/inbox, a BOB file system directory. The 
client program presents the entire path name to the file service 
clerk. The clerk locates a name server that stores the root directory 
and presents the complete name. That server may store the directories 
down to, say, /com/dec. The directory entry for /com/dec/src will 
indicate another set of servers. So the first lookup operation will 
return the new server set and the remaining unresolved path name. 
The clerk will then contact a server in the new set and present the 
unresolved path src/udir/bwl/Mail/inbox. This server discovers that 
src/udir is a junction to a file system volume, so returns the junction 
information and the unresolved path name udir/bwl/Mail/inbox. Finally, 
the clerk uses the junction information to contact a file server, 
which in this example actually stores the target directory and responds 
with the directory contents. What looks like a single operation to 
the client program actually involves RPCs to three different servers 
by the clerk. 

This example of a name resolution would be completed with fewer or 
no operations at remote servers if the clerk has a cache that already 
contains the necessary information. In practice, caches are part 
of most clerk implementations and most operations are substantially 
speeded by the presence of cached data. 

Other issues arise when implementing a high-availability service 
with long term state. The servers must cooperate to maintain consistent 
state among themselves, so a backup can take over reasonably quickly. 
Problems to be solved include arranging that no writes get lost during 
failover from one server to another and that a server that has been 
down can recover the current state when it is restarted. Combining 
these requirements with caching and write-behind to obtain good performance, 
without sacrificing consistent sharing, can make implementing a highly 
available service quite challenging. 


5. Conclusion

This chapter covers the inherent strengths of centralized and networked 
computing systems. It outlines the structure and properties of BOB, 
a state-of-the-art distributed computing base for supporting general-purpose 
computing. This system combines the best features of centralized 
and networked systems with recent advances in security and availability 
to produce a powerful, cost-effective, easy-to-use computing environment. 

Getting systems like BOB into widespread use will be hard. Given 
the state-of-the-art in distributed systems technology, building 
a prototype for proof of concept is certainly feasible. But the only 
practical method for getting widespread use of systems with these 
properties is to figure out ways to approach the goal by making incremental 
improvements to existing networked systems. This requires producing 
a sequence of useful, palatable changes that lead all the way to 
the goal. 


6. Acknowledgements

The material in this chapter was jointly developed by Michael Schroeder, 
Butler Lampson, and Andrew Birrell. The ideas explored here aggregate 
many years of experience of the designers, builders, and users of 
distributed systems. Colleagues at SRC and fellow authors of this 
book have provided useful suggestions on the presentation.