Communication support in operating systems for distributed transactions

This paper describes the communication functions required for distributed transaction processing. The paper begins with a discussion of models that illustrate how a communication subsystem fits into a proposed system architecture. Then, it describes the system and user activities that depend on the communication subsystem. Finally, it uses these activities to motivate the facilities that should be provided by a communication subsystem that supports transaction processing


Three Models
There is substantial agreement on the underlying system model for distributed processing.The model has processing nodes and communication networks, as illustrated in Figure 2 In Level 5, applications use the DTF to begin, commit, and abort transactions and to execute operations on objects.Example applications include a banking terminal system and an interactive interface to a database manager.
This architecture provides two benefits over traditional architectures that blur the distinction between Levels 3, 4, and 5: First, because many of the components that support transactions are standardized and moved lower into the system hierarchy, there is the potential to implement them more efficiently.Second, the architecture provides a common notion of transaction and data object for all objects and applications in the system, and permits more uniform access to data.This permits an application, for example, to update transactional^ a relational database containing indexing information, a file containing image data, and a hierarchical database containing performance records.All the system components also use standardized facilities for performing remote accesses, for transaction commitment, etc.
Having characterized the computational activities required for distributed transaction processing, it is now possible to examine the activities that use the communication subsystem.We can then turn our attention to requirements of the communication subsystem and how to meet them.

3-Activities Requiring Communication Support
Distributed transaction facilities require communication for data objects and applications in Levels 4 and 5 and system activities in Levels 1 through 3.This section lists those activities and then describes a set of communication subsystem functions that will support them.

Communication-related Activities of Data Objects and Applications
Before data objects and applications can begin to access other data objects, they must first establish a communication path to them.A name service locates servers that encapsulate objects and returns lower-level names that can be used to establish communication.To support distributed replication algorithms, a single object name may be associated with a several copies of objects, each stored on a different node.Typically, replication techniques specify the number of copies of an object to which they require access.
The name service must manage the name space to prevent unintended name duplication and to ensure appropriate authorization for name insertion and deletion operations.This is clearly a distributed data management problem, which is best solved by a collection of trusted Level 4 objects that can use the DTF.Thus, while the name service is logically related to communication services, it need not be implemented within the communication subsystem.However, the communication subsystem must provide the name service with well-known connections through which name servers can communicate with each other.
Questions arise concerning the permanence of name mappings, the granularity of objects that are named, and the management of the name space.There are many feasible answers to these questions, but here are some reasonable ones: • Name mappings are relatively useless for objects that are inaccessible; when an object is unusable, it usually does not help to know its location.Hence, the motivation for replicating name mappings on another node is to reduce communication, not to provide availability.
• Objects registered in the system-wide name service should be coarsely grained; e.g., a database name rather than the names of all its relational tables.More detailed name resolution can be performed in an object-specific fastrion.This decision is almost a necessity, both to reduce the number of communication paths to a server and to obviate the need for a uniform name space for each data object in an entire distributed system.
• • Flow-control and pipelining.Large amounts of data may be passed to and from objects and require flow-control.For example, a request to a remote object could result in a response containing megabytes of data.Pipelining may be useful on networks having long latencies.Even on local area networks, the increasing use of networks interconnected by bridges or gateways tends to increase delays and the consequent need for pipelining.
• Crash detection.There must be a mechanism for determining if a server has crashed after its first use and prior to commit.Timing out while awaiting a response from an object is one crash detection technique, but a session failure provides more uniform and timely information for most errors.For example, sessions can detect most crashes even when a client is not calling its server.

Communication-related Activities of System Levels
The data objects and applications of Levels 4 and 5 require communication primitives that provide very general functions: support for arbitrarily long messages, authentication, and the like.In Levels 1, 2, and 3, communication is more constrained and there is more a-priori knowledge of message contents.For example, knowledge that a message usually fits within a network packet permits a simpler transmission protocol to be used.Similarly, some data can be piggy-backed on messages sent by Levels 4 and 5.

Communication services required by
Figu re 2 -1: Hardware Model Name mappings should survive node crashes to reduce the amount of work required to restart a node.Once the name service has located an object, the lower-level object name can be passed to the communication subsystem and a session created.Clients and servers require sessions betwen them for many reasons: • Authentication and protection.If clients are to be certain they are accessing a particular server and servers are to check the access rights of a caller, then the communication session must be authenticated in some way, possibly using encryption techniques [Needham and Schroeder 78].Prevention of active and passive attacks on the communication channel is also a desirable service for many applications [Sansom et al. 86, Birrell and Nelson 84].
Communication on a session usually takes the form of a (synchronous) remote procedure call having at-most-once semantics.While there is room for diversity in the definition of these semantics, all definitions guarantee that an operation on a server will be performed at most one time, despite network failures and retransmissions.Providing higher service levels in the communication subsystem (atomicity, or exactly-once semantics) is unneeded because the DTF can completely abort arbitrary units of work, which may then be retried.As mentioned above, requests and responses may have unlimited lengths so intra-message flow control may be needed.Sometimes, more general forms of remote procedure call may be useful [Spector 82].Asynchronous RPC's permit a client to continue processing and to receive a signal when a response is returned.Multicast RPC's issue a request to multiple servers.A multicast RPC primitive may await all responses before returning or it may signal the client as each response arrives.The latter organization is useful when a client invokes an operation on multiple servers but does not need all responses before continuing work.Multicast RPC's may be implemented on multiple sessions, or may use a single session having multiple destinations.The latter is required if low-level network multicast primitives are to be used.
observable dependencies between events are reflected in time values provided by the clock; that is, if Server 1 observes the time as A and it sends a message to Server 2, and then Server 2 receives the message, Server 2 will then observe the time as B, with B > A. Such a mechanism is useful for various types of synchronization, for example, for supporting hybrid atomicity [Herlihy 85].The underlying algorithm makes use of a counter on each node and a field included in each inter-node message that may update the counter.A distributed real time service that is synchronized across nodes supports synchronization algorithms and performance measurement techniques.Many implementations of such mechanisms require periodic exchange of time information, which is done by appending information to existing message traffic and sending short messages during idle periods.To perform atomic commit processing, the DTF must send control messages such as Prepare-to-Commit, Prepare-Ack, Commit and Commit-Ack [Lindsay et al. 79].Some of these messages are typically sent to one or more of the nodes involved in a transaction.Regardless of protocol, the communication subsystem should maintain appropriate information on the nodes involved in the transaction, and control messages should be sent with low-overhead.Usually, they can be sent as network datagrams because messages are short and reliable transmission is not needed; the transaction manager must deal with node crashes anyway.Even though data encryption and authentication may be needed for Level 4 and 5 communication, control messages are difficult to forge and they contain so little data that is valuable to outsiders that there may be no reason to encrypt them.However, certain commit protocols can benefit from transmission to a multicast address that is incrementally developed as the transaction executes.A DTF that supports nested transactions also requires a lock-resolution protocol in addition to the commit protocols.This protocol is invoked to determine if a nested transaction can inherit a lock from a relative in the tree of transactions.Depending upon the frequency of lock inheritance, this protocol may be invoked often and require high performance.Distributed deadlock detection algorithms typically require piecing together enough of the distributed M wait-for" graph to break cycles.This requires the periodic transmission or the piggybacking of information on other messages.The communication facilities themselves require communication in addition to the usual demands for session establishment and the transmission and acknowledgment of user-supplied data.For example, control messages are sent by authentication servers as part of session creation."Are you there messages" may be periodically sent on sessions to rapidly detect server or node crashes.Finally, to aid in reliability testing, the communication subsystem should enable users to test the system under conditions of communication failures: lost, duplicate, and corrupted packets; partitions; and delays.Being able to simulate these conditions is an important feature.Also, facilities for monitoring the performance of the communication subsystem are useful.Methodical, empirical testing is needed to develop robust systems.4. Communication Subsystem Functions and Implementation This section lists a plausible set of functions that a communication subsystem should provide, given the requirements described in the previous section.It also describes the broad outlines of an implementation strategy for them.This ideas are loosely based on our design of TABS and Camelot with additions from other systems where needed.4.1.Name Service The name service should provide primitives to associate a name with one or more servers that implement the named object.It may also associate a lower-level name used by a server to distinguish between the multiple objects it implements.The name service also provides primitives to lookup and delete names.The lookup primitive should permit the caller to specify how many servers should be returned and to set a timeout interval after which control will be returned.While the name service does not need to be part of the communication subsystem, it is closely related and worthwhile to include in this section.One implementation strategy is to have multiple name servers on the network that communicate with each other.Because of the desirability of storing name bindings permanently (so as to not have to register objects after a crash), the name service should be implemented as transactional (Level 4) servers, which can utilize stable storage.The DTPs services also simplify the consistency management of the name space.For example, new names can be added within a transaction.Locally storing recently used name bindings (hints) reduces the amount of inter-node communication, provided that applications are willing to detect and handle potentially outof the communication between system-level entities.At minimum, sessions should support an efficient implementation of RPC with "are you there" messages to detect crashes.However, a session's required functions and implementation (including the amount of state that must be maintained) varies with the requirements of the DTF and the structure of the underlying system.For example, sessions supporting multicast RPC should have multiple recipients to take advantage of low-level multicast facilities.Differing needs for asynchronous RPC's, protection, authentication, conversion of heterogeneous data, and arbitrary internetworking also influence the functions and implementation of sessions.There are at least two facilities for supporting a DTF that a communication subsystem can perform: It can record the participants in a transaction by watching the messages and the transaction identifiers contained in them.This information is needed at commit time.Additionally, the communication subsystem can incrementally distribute a network multicast address to all the sites within a transaction so that network multicast can be used during the two-phase commit protocol.This multicast address can be related to the global transaction identifier and be piggy-backed on request messages.Cheriton describes a design for this in the V System [of datagram-based communication is to reduce transmission latency and CPU overhead.In order to keep datagram-based communication sufficiently lightweight, it is inevitably restricted in function: limited datagram sizes, lack of protection or authentication, etc.New functions that slow datagram transmission should be avoided.Certainly, datagram communication should support unreliable point-to-point transmissions; also, it should support multicast, because many networks provide necessary hardware support.Both of these two services require little protocol layering.Possibly, there should be some datagram support that is tailored to operation on a single locaf area network recognizmg-that there are services that would not be used over a long-haul network.For example, a stable storage server (log) would almost certainly be on the same local area network as its client nodes [Daniels et al. of miscellaneous features that a communication subsystem should support: a distributed (logical and/or real) time service, the parameterized insertion of errors or creation of network partitions, and a communication performance monitor.Other features may be needed for real-time applications or some high-availability architectures. is not surprising to find that sessions and datagrams are the two most important facilities.However, in this environment where there is closely-coupled distributed processing, atomic commitment, replication, and a strong emphasis on reliable, highly available operation, there are some additional features that a communication subsystem should support.These include multicast, logical time, real time, performance evaluation, and fault insertion services.Higher level protocols not part of the communication subsystem but closely related to it are needed for commitment, nested transaction lock resolution, deadlock detection, and name resolution.All these additional facilities necessarily require standardized interfaces and protocols to support open systems.In some instances, these facilities are being considered by standardization