Andrey Ermolinskiy
, Daekyeong Moon
, Byung-Gon Chun
, Scott Shenker
University of California at Berkeley,
Intel Research Berkeley,
ICSI
A SAN architecture is a particularly attractive choice for parallel clustered applications that demand high-speed concurrent access to a scalable storage backend. Such applications commonly rely on a clustered middleware service to provide a higher-level storage abstraction such as a filesystem (GFS [35], OCFS [8], PanFS [10], GPFS [37]) or a relational database (Oracle RAC [9]) on top of raw disk blocks.
One of the primary design challenges for clustered SAN applications and middleware is ensuring safe and efficient coordination of access to application state and metadata that resides on shared storage. The traditional approach to concurrency control in shared-disk clusters involves the use of a synchronization module called a distributed lock manager (DLM). Typically, DML services aim to provide the guarantee of strict mutual exclusion, ensuring that no two processes in the system can simultaneously hold conflicting locks. In abstract terms, providing such guarantees requires enforcing a globally-consistent view of lock acquisition state and one could argue that a traditional DLM design views such consistency as an end-in-itself rather than a means to achieving application-level correctness.
In this paper, we take a close look at the semantics of SAN lock services and ask whether the assurances of full mutual exclusion and strongly-consistent locking are, in fact, a prerequisite for correct application behavior. Our main finding is that the standard semantics of mutual exclusion provided by a DLM are neither strictly necessary nor sufficient to guarantee safe coordination in the presence of node failures and asynchrony. In particular, processing and queuing delays in SAN switches and host bus adapters (HBAs) expose applications to out-of-order delivery of I/O requests from presumed faulty processes which, in certain scenarios, can incur catastrophic violations of safety and cause permanent data loss.
We propose and evaluate a new technique for disk access coordination in SAN environments. Our approach augments target storage devices with a tiny application-independent functional component, called a guard, and a small amount of state, which enable them to reject inconsistent I/O requests and provide a property called session isolation.
These extensions enable a novel optimistic approach to concurrency control in SANs and can also make existing lock-based protocols safe in the face of arbitrarily delayed message delivery, drifting clocks, crash process failures, and network partitions. The session isolation property in turn provides a foundational primitive for implementing more complex and useful coordination semantics, such as serializable transactions, and we demonstrate one such protocol.
We then describe the implementation of Minuet- a software library that provides a novel synchronization primitive for SAN applications based on the protocols we present. Minuet assumes the presence of guard at the target storage devices and provides applications with locking and distributed transaction facilities, while guaranteeing liveness and data safety in the face of arbitrary asynchrony, node failures, and network partitions. Our evaluation shows that applications built atop Minuet compare favorably to those that rely on a conventional strongly-consistent DLM, offering comparable or better performance and improved availability.
Unlike existing services for fault-tolerant distributed coordination such as Chubby [20] and Zookeeper[15], Minuet requires its lock managers to maintain only loosely-consistent replicas of locking state and thus permits applications to make progress with less than a majority of lock manager replicas. To demonstrate the practical feasibility of our approach, we implemented two sample applications - a distributed chunkmap and a B+ tree - on top of Minuet and evaluated them in a clustered environment supported by an iSCSI-based SAN.
The benefits of optimistic concurrency control and the associated tradeoffs have been extensively explored in the database literature and are well understood. In particular, techniques such as callback locking, optimistic 2-phase locking, and adaptive callback locking [18, 21, 24, 42] have been proposed to enable safe coordination and efficient caching in client-server databases. It is important to note, however, that these approaches are not directly applicable to SANs because they assume the existence of a central lock server, typically co-located with the data block storage server. This assumption does not hold in a SAN environment, where the storage "servers" are application-agnostic disk arrays that possess no knowledge of locking state or node liveness status. Hence, a conservative DLM service that enforces strict mutual exclusion has traditionally been viewed as the only practical method of coordinating concurrent access to shared state for SAN applications.
Our main insight is that a single nearly trivial extension to the internal logic of a SAN storage device suffices to address the data safety problems associated with traditional DLMs and enables a very different approach to protocol layering for storage access coordination. Crucially, we achieve this without introducing application-level logic into storage devices and without forfeiting the generality and simplicity of the traditional block-level interface to SAN-attached devices.
The technical feasibility of device-based synchronization and its practical advantages have been demonstrated by several earlier proposals [12, 29]. Our study builds on this earlier work and while prior efforts have primarily focused on moving the functionality of a traditional cluster lock manager into the storage device, Minuet aims to provide a more general and useful synchronization primitive that supports a wider range of concurrency control mechanisms. In addition to supporting traditional conservative locking, our approach enables an optimistic method of concurrency control that can improve performance for certain application workloads. Further, Minuet allows existing locking protocols to remain safe in the presence of arbitrarily-delayed message delivery, node failures, and network partitions.
The rest of this paper is organized as follows. In Section 2, we provide the relevant background on SAN and some representative examples of data safety problems. In Section 3, we present our main contribution - the design of Minuet, a novel safe and highly available synchronization mechanism for SAN applications. Section 4 describes our prototype implementation and two sample parallel applications. We evaluate our system in Section 5 and discuss practical aspects of our approach in Section 6. Finally, we discuss related work in Section 7 and conclude in Section 8.
SANs aim to provide fully decentralized access to shared application state on disk and in principle, any SAN-attached client node can access any piece of data without routing its requests to a dedicated server. While in this model, all requests on a particular piece of data are centrally serialized, the crucial distinction from the traditional server-attached storage paradigm is that the point of serialization is a hardware disk controller that exposes an application-independent I/O interface on raw disk blocks and is oblivious to application semantics and data layout considerations.
Broadly, the SAN paradigm is advantageous from the standpoint of availability because it offers better redundancy and decouples node failures from loss of persistent state. Incoming application requests can be routed to any available node in the cluster and, in the event of a node failure, subsequent requests can be redirected to another processor with minimal interruption of service.
One of the primary design challenges for clustered SAN applications and middleware is ensuring safe and efficient coordination of access to shared state on disk and commonly, a software service called a distributed lock manager (DLM) is employed to provide such coordination. A typical lock service such as OpenDLM[7] operates on shared resources, abstract application-level entities that require access coordination, and attempts to provide the guarantee of mutual exclusion - no two processes may simultaneously hold conflicting locks on the same resource.
Scenario 1: Consider a data structure spanning 10 blocks on a shared disk and two clients, and , that are accessing the data structure concurrently. is updating blocks of under the protection of an exclusive lock, while wants to read in its entirety (i.e., blocks ) and is waiting for a shared lock. Suppose crashes after sending its request to but before hearing the response. The lock manager correctly detects the failure, reclaims the exclusive lock, and grants it to in shared mode. Next, proceeds to reading and, assuming that a single disk request can carry up to 5 blocks of data, issues two requests: and . Suppose 's delayed request on blocks reaches the disk after but before , in which case only the latter would reflect the effects of 's update. Hence, although individual I/O requests are processed by as atomic units, their inconsistent interleaving would cause to observe and act upon a partial update from , which can be viewed as a violation of data safety.
As an alternative to heartbeat failure detection, a lease-based mechanism [26] can be used to coordinate clients' accesses in the above example, but precisely the same problematic scenario would arise when clocks are not synchronized. When crashes and its lease expires, the lease manager could grant it to prior to the arrival of the last from to the storage target. Since the target does not coordinate with the lease manager, it fails to establish the fact that an incoming request from is inconsistent with the current lease ownership state.
Scenario 2: Clustered applications and middleware services commonly need to enforce transactional semantics on updates to application state and metadata. In a shared-disk clustered environment, distributed transactions have traditionally been supported by two-phase locking in conjunction with a distributed write-ahead logging (WAL) protocol. In the abstract, the system maintains a snapshot of application state along with a set of per-client logs (also on shared disks) that record Redo and/or Undo information for every transaction along with its commit status. During failure recovery, the system must examine the suspected client's log and restore consistency by rolling back all uncommitted updates and replaying all updates associated with committed transactions that may not have been flushed to the snapshot prior to the failure. An essential underlying assumption here is that once log recovery is initiated, no additional requests from the suspected process will reach the snapshot. A violation of this assumption could result in the corruption of logs and application data.
Ensuring data safety in a shared-disk environment has traditionally required a set of partial synchrony assumptions to allow reliable heartbeat-driven failure detection and/or leases. For example, lease-based mechanisms typically expect bounded clock drift rates and message delivery delays to ensure the absence of in-flight I/O requests upon lease termination. However, these assumptions are probabilistic at best and since application data integrity is predicated on the validity of these assumptions, failure timeouts must be tuned to a very conservative value to account for worst-case delays in switch queues and client-side buffering. Such (necessarily) pessimistic timeouts may have a profoundly negative impact on failure recovery times - one of the common criticisms of SAN-oriented applications[16].
Another serious limitation exhibited by today's SAN applications is liveness. The DLM (or lease manager) represents an additional point of failure and while various fault tolerance techniques can be applied to improve its availability, the very nature of the semantics enforced by the DLM places a fundamental constraint on the overall system availability. For instance, multiple lock manager replicas can be deployed in a cluster, but mutual exclusion can be guaranteed only if clients' requests are presented to them in a consistent order, which necessitates consensus mechanisms such as Paxos [31]. Alternatively, a single lock manager instance can be elected dynamically [27] from a group of candidates and in this case, ensuring mutual exclusion necessitates global agreement on the lock manager's identity. In both cases, reaching agreement fundamentally requires access to an active primary component - typically a majority of nodes. As a result, a large-scale node failure or a network partition that renders the primary component unavailable or unreachable may bring about a system-wide outage and complete loss of service.
To summarize, today's SAN applications and middleware face significant limitations along the dimensions of safety and liveness. At present, several hardware-assisted techniques, such as out-of-band power management (STOMITH) [3], SAN fabric fencing [1], and SCSI-3 PR[11] can be employed to mitigate some of these issues. These mechanisms help reduce the likelihood of data corruption under common failure scenarios, but do not provide the desired assurances of safety and liveness in the general case and, as we would argue, do not address the underlying problem. We observe that the underlying problem may be a case of capability mismatch between "intelligent" application processes that possess full knowledge of application's data structures, their disk layout, and consistency semantics on the one hand and relatively "dumb" storage devices on the other. The safety and liveness problems illustrated above can be attributed to a disk controller's inability to identify and appropriately react to the various application-level events such as lock release, failure suspicion, and failure recovery action.
Rather than restricting access to critical code sections, our approach views the access coordination problem in terms of I/O request ordering guarantees that the storage system must provide to application processes. We refer to this alternate notion of correctness as session isolation.
We define this correctness property in formal terms below and then present a protocol that achieves session isolation with the help of guard logic. Finally, we demonstrate how distributed multi-resource transactions can be supported using session isolation as a building block.
We define to be the set of all sessions to from active at time , which is determined solely by the sequence of 's prior upgrade and downgrade requests to the lock service. may contain a shared or an exclusive session to , or both, or none.
We say that a shared session conflicts with every exclusive session to the same resource and an exclusive session conflicts with every other session to .
Informally, the above condition requires to observe the prefixes of all sessions to in strictly serial order, ensuring that no two requests in a client's session are interleaved by a conflicting request from another client. To illustrate this definition, consider a pair of concurrent request sequences shown in Figure 1. In this scenario, the following two orderings of request observations by the owner of shared resource would satisfy session isolation:
, , , , , , , , | ||
, , , , , |
Note that session isolation is more permissive than strict mutual exclusion and in particular, permits execution histories in which two clients simultaneously hold conflicting locks on the same shared resource. At the same time, one could argue that these semantics meaningfully capture the essence of shared-disk locking, by which we mean that the request ordering guarantees provided by session isolation are precisely those that applications developers have come to expect from a traditional DLM. To see this, observe that in the previous example, a conventional lock service offering full mutual exclusion would cause to observe by granting clients' requests in the order . Likewise, corresponds to a possible failure scenario in which crashes after acquiring its locks, causing the DLM to reclaim them and grant ownership to .
A session annotation for a disk command operating on has two components: a session verifier and a session update, denoted by and , respectively. For commands that belong to an existing session, the verifier enables the target to confirm session validity prior to accepting the command and is used by the initiator to signal the start of a new session.
For each shared resource , its owner device maintains a local session identifier (denoted ) on persistent storage. Upon receipt of an I/O command from an initiator, the owner invokes the guard, which evaluates the command's session annotation against and determines whether session isolation would be preserved by accepting the command. Functionally, the guard operation is a form of compare-and-set and we describe this operation in detail in Section 3.3.
If an incoming I/O request fails verification, the target drops the request from its input queue and notifies the initiator via a special status code EBADSESSION. From an application developer's point of view, session rejection appears as a failed disk request along with an exception notification from the lock service indicating that a lock on the respective resource is no longer valid.
The guard situated at the target devices addresses the safety problems due to delayed messages and inconsistent failure observations that plague asynchronous distributed environments and enforcing safety at the target device permits us to simplify the core functionality of the DLM module. In our scheme, the primary purpose of the lock service is ensuring an efficient assignment of session identifiers to clients that minimizes the frequency of command rejection for a given application workload.
Decoupling correctness from performance in this manner enables substantial flexibility in the choice of mechanism used to control the assignment of session identifiers. At one extreme is a purely optimistic technique, whereby every client selects its SIDs via an independent local decision without attempting to coordinate with the rest of the cluster and this might be an entirely reasonable strategy for applications and workloads characterized by a consistently low rate of data contention. A traditional DLM service that serializes all session requests at a central lock server can be viewed as a design point at the other extreme. Minuet aims to position itself in the continuum between these extremes and allow application developers to trade off lock service availability, synchronization overhead, and I/O performance.
Client-side state: For each shared resource
, a client
maintains a pair of session identifiers for its shared and exclusive sessions to
, denoted by
and
, respectively. Additionally,
identifies the current session type, one of
, and
holds the client's session continuation type. The latter value is used by the target device to verify (prior to executing a request from
) that its existing session has not been broken by a conflicting request from another client. Finally, every client
maintains an estimate of the largest shared and exclusive timestamp values previously assigned to a session identifier for any client, which we denote by
and
. Initially,
,
, and
. The steps and states of the basic locking and storage access protocols are illustrated in Figure 2.
Acquiring locks: To acquire/upgrade a lock on resource
, a client
proposes a unique session timestamp pair
to the lock manager. To acquire a
lock on
,
sets
and sets
to some unique timestamp greater than
. The client then sends an
request to the lock manager, specifying the desired mode (
) and the proposed timestamp pair. The lock manager accepts and enqueues this request if no request with a larger
value has been accepted. Otherwise, the manager denies the request and responds with
, which includes the largest timestamp values observed by the manager. In the latter case, the client updates its local estimates
and submits a new proposal. After accepting and enqueuing
's request, the lock manager eventually grants it and responds with
. The client then sets
and initializes the shared session identifier:
.
To upgrade a lock from to , the client sends to the lock manager after setting and to some unique timestamp greater than . Upon receiving from the lock manager, the client sets and . Upgrading from to is functionally equivalent to acquiring a lock and then upgrading to , but as an optimization, these operations can be combined into a single request to the lock manager.
Accessing shared storage:
After establishing a session to
by acquiring a corresponding lock, client can proceed to issuing disk requests that operate on the content of
. Each outgoing request is augmented with a session annotation that enables the target device to verify proper ordering of requests and enforce session isolation. The annotation carries a tuple of the form
and is initialized as shown in Figure 3.
Upon receipt of a disk request from a client, the owner device invokes the guard logic, which evaluates the session annotation as specified in Figure 3. In the event of rejection, the owner immediately discards the command and sends an response to the client, together with a response annotation carrying . Otherwise, the owner executes (or enqueues) the command and updates its local session identifier as shown in the figure.
Upon receipt of an status code, the initiator examines the response annotation and notifies the application process that its lock and session on is no longer valid. The condition indicates interruption of an exclusive session, in which case the client downgrades its lock to , sets , and sets . A lock is further downgraded to if (since in this case, a conflicting exclusive-session request has been accepted). In this situation, the client sets and . In both cases, the maximum timestamp estimates and are updated to reflect the most recent timestamps observed by the owner.
Upon receiving a status code, the client sets and updates the shared session identifier to reflect the most recent value in the annotation: . (This step is necessary to ensure that a shared session remains valid after a upgrade or a downgrade to ).
Downgrading locks:
To downgrade an existing lock from
to
, the client sends a DowngradeLock request to the lock manager and resets the exclusive-session state:
,
. Similarly, to downgrade from
to
, the client notifies the lock manager and sets
,
. Upon receipt of a
request, the manager updates the ownership state for
and, if possible, grants the lock to the next waiter in the queue.
Correctness:
The locking protocol and the guard described above guarantee session isolation and a formal correctness argument can be found in [22]. Informally, consider two clients
and
that compete for shared and exclusive access to
, respectively, and suppose that a shared-session request from
gets accepted with
in its annotation. Observe that due to global uniqueness of session proposals, the owner of
would subsequently accept an exclusive-session request from
with verifier
only if
is strictly greater than
. In this case, subsequent shared-session requests from
would fail verification, causing
to observe
and downgrade its lock. Thus, session isolation would be preserved in this example via a forced termination of
's session. A similar argument demonstrates that no two exclusive-session commands can be interleaved by a conflicting command from another client.
(1) Avoid introducing assumptions of synchrony required by conventional transaction schemes for SAN environments. We rely on the guard at target devices to provide session isolation and protect the state on disk from the effects of arbitrarily-delayed I/O commands operating on the application data and the log.
(2) Eliminate reliance on strongly-consistent locking. Rather than requiring clients to coordinate concurrent activity via a strongly-consistent DLM, the guard at storage devices enables a limited form of isolation and permits us to relax the degree of consistency required from the lock service. Prior to committing a transaction, a client process in Minuet issues an extra disk request, which verifies the validity of all locks acquired at the start of the transaction. This mechanism allows us to identify and resolve cases of conflicting access due to inconsistent locking state at commit time and can be viewed as a variant of optimistic concurrency control - a well-known technique from the DBMS literature[30].
(3) Avoid enforcing a globally-consistent view of process liveness. Rather than relying on a group membership service to detect client failures and initiate log recovery proactively in response to perceived failures, our design explores a lazy approach to transaction recovery that postpones the recovery action until the affected data is accessed. This enables Minuet to operate without global agreement on group membership.
To support transactions, we extend the basic session isolation machinery described in Section 3.3 with an additional piece of state called a commit session identifier (CSID) of the form clntID, xactID . We extend the format of a session annotation to include two commit session identifiers, denoted and , and both are set to NIL unless specified otherwise. For each shared resource , the owner device maintains a local commit session identifier (R.ownerCSID) as well as . Conceptually, the value of R.ownerCSID at a particular point in time identifies the most recent transaction that may have updated and committed without flushing its changes to the disk image of . If R.ownerCSID NIL, the current state of on disk may be missing updates from a committed transaction and thus cannot be assumed valid. In this case, identifies the client process responsible for the latest transaction on and it is used to locate the corresponding log for recovery purposes.
Upon receiving a disk request, the guard examines the annotation and rejects the request if or if . A request is accepted only if its and both pass verification and upon completing the request, the owner device updates its local commit session identifier by setting . If verification fails, the owner responds with and attaches the tuple in a response annotation.
In Minuet, transactions proceed in five stages: Begin, Read, Update, Prepare, and Commit and we illustrate them using high-level pseudocode in [22]. During one-time client initialization, Minuet's transaction service locks the local client's log in mode. To begin a new transaction , the client selects a new transaction identifier ( ) via a monotonic local counter and appends a BeginXact record to its log. Next, in the Read phase of a transaction, the application process acquires a lock on every resource in and reads the corresponding data from shared disks into local memory buffers. In the Update phase that follows, the client acquires locks on the elements of , applies the desired set of updates locally, and communicates a description of updates to Minuet's transaction service, which appends the corresponding set of records to the log. Each such record describes an atomic mutation on some resource in and essentially stores the parameters of a single disk command.
The Prepare phase serves a dual purpose: to verify the validity of client's sessions (and hence, the accuracy of cached data) and to lock the elements of the write set in preparation for committing. For each resource in , the client sends a special request to its owner. Minuet implements requests as zero-length s, whose sole purpose is to transport an annotation and invoke the guard. requests for elements of carry and in their annotations, where is the client's identifier. If all requests return , the transaction enters the final Commit phase, in which a CommitXact record is force-appended to client 's log.
The protocol outlined above ensures transaction isolation, identifying cases of conflicting access during the Prepare phase. Recall, however, that under the session isolation semantics, any I/O command, including operations on the log, may fail with due to conflicting access from other clients. This gives rise to several exception cases at various stages of transaction execution. For example, a client may receive an error while forcing CommitXact to disk due to loss of session to the log. This can happen only if another process has initiated log recovery on and hence, the active transaction must be aborted. Other failure cases and the corresponding recovery logic are described in the report[22].
Syncing updates to disk:
After committing a transaction, a client
flushes its locally-buffered updates to
simply by issuing the corresponding sequence of
commands to its owner device. Each such command specifies in its annotation
, where
denotes
's most recent committed transaction that modified
. After flushing all committed updates,
issues an additional zero-length
request, which specifies
and
in the annotation. This request causes the device to reset
to
, effectively marking the disk image of
as "clean". Lastly,
appends to its log an
record of the form
.
Lazy transaction recovery:
A client
can initiate transaction recovery when its disk command on some resource
fails with
and a non-
value
is specified in the response annotation. This response indicates that the disk image of
may be missing updates from a transaction committed earlier by another client
. If
suspects that
has failed, it invokes a local recovery procedure that tries to repair the disk image of
. First,
acquires exclusive locks on
and
and reads the log from disk. Next,
searches the log for the most recent transaction that has successfully flushed its updates to
, from which it determines the list of subsequent committed updates that may be missing from the disk image. The client then proceeds to repairing the state of
on disk by reapplying these updates and all
requests sent to the owner during this phase specify
in the annotation. Finally, after reapplying all missing updates,
completes recovery by issuing a zero-length
annotated with
,
. A more detailed discussion of transaction recovery in Minuet can be found in [22].
To support manager replication, we extend the basic locking protocol presented in Section 3.3 as follows: When acquiring or upgrading a lock, a client selects a subset of managers, which we call its voter set, and sends an request to all members of this set. The lock is considered granted once votes are collected from all members. If any of the voters respond with due to an outdated timestamp, the client downgrades the lock on all members that have responded with , updates its and values, and resubmits the upgrade request with a new timestamp proposal. As a performance optimization, we allow requests to specify an implicit downgrade for an earlier timestamp.
The bottom-level driver implements a TCP encapsulation of SCSI and our current prototype builds on the Open-iSCSI Initiator driver [6] v2.0-869.2. We used the additional header segment (AHS) feature of iSCSI to attach Minuet annotations to command PDUs and defined a new AHS type for this purpose.
Our storage backend is based on the iSCSI Enterprise Target driver[5] v0.4.16, which exposes a local block device to remote initiators via iSCSI. We extended it with the guard logic, which examines incoming PDUs and makes an accept/reject decision based on the annotation. Command rejection is signaled to the initiator via the REJECT PDU defined by the iSCSI standard.
The addition of guard logic represents the most substantial extension to the SAN protocol stack, but incurs only a modest increase in the overall complexity. The initial implementation of the Enterprise Target driver contained 14,341 lines of code and augmenting it with Minuet guard logic required adding 348 lines.
|
In each iteration, we measure the aggregate goodput (the number of successful application-level operations per second) from all nodes and the rate of disk command rejection under the following locking configurations:
strong(
): We deploy a total of
lock managers and require clients to obtain permissions from a majority (
). Note that strong(1) represents a traditional locking protocol with a single central lock manager, while strong(2) requires 3 lock manager replicas and masks one failure.
weak-own: An extreme form of weakly-consistent locking. Each client obtains permissions only from the local lock manager (co-located on the same machine) and does not attempt to coordinate with the other clients.
In all of our experiments, applications rely on Minuet to provide both modes of locking and do not make use of any other synchronization facilities.
uniform: In each operation, a chunk to be modified is selected uniformly at random.
hotspot(
):
of operations touch a hotspot region of the chunkmap constituting 0.1% of the entire dataset.
Table 1 reports the aggregate goodput under the uniform workload, which represents a low-contention scenario. The goodput exhibits linear scaling with the number of storage servers. Further, there is no measurable difference in performance between the three locking configurations. These results suggest that the optimistic method of coordination enabled by Minuet does not adversely affect application performance, while providing safety, in scenarios where the overall I/O load is high, but contention for a single resource is relatively rare.
The rate of I/O rejection increases when the workload has hotspots and, as expected, weak-own suffers a performance hit proportional to the popularity of the hotspot (Figure 5). We note that the hotspot workload represents a very stressful case (the hotspot size is 0.1%) and our results demonstrate that weakly-consistent locking degrades gracefully and can still provide reasonable performance in such scenarios.
We also ran experiments in a partitioned network scenario, where each client can communicate with only a subset of replicas. A strongly-consistent locking protocol demands a well-connected primary component containing at least a majority of manager replicas - a condition that our partitioned scenario fails to satisfy. As a result, no client can make progress with traditional strong locking and the overall application goodput is zero. In contrast, under Minuet's weak locking, clients can still make good progress. This demonstrates the availability benefits that Minuet gains over a traditional DLM design.
5.3 Distributed B+ treeThe B+ tree application demonstrates Minuet's support for serializible transactions. In this experiment, we start with a pre-populated tree and run the application for 5 minutes on 32 client nodes. Each client inserts a series of randomly-generated keys and we measure the aggregate goodput, defined as the total rate of successful insertions per second from all clients. To test Minuet's behavior under different transaction complexity and contention scenarios, we used two different pre-populated B+ trees, whose parameters are given in Figure 6.
Figure 7(left) compares the performance of strong(1) and weak-own. Under both locking schemes, the throughput exhibits near-linear scaling with the number of storage targets. As expected, tree-large demonstrates a lower aggregate transaction rate because each transaction requires accessing a longer chain of nodes. Moreover, since the number of leaf nodes is large, read-write or write-write contention is relatively infrequent and hence, the performance penalty due to I/O rejection incurred under weak-own is negligible. By contrast, tree-small represents a high-contention workload and our results suggest that even in this stressful scenario, Minuet's weak locking incurs only a modest performance penalty.
Further investigation revealed that the primary cause of the performance degradation was an outdated estimate of maximum timestamps , causing some of the commands to carry outdated session identifiers (e.g., with ). Under weak-own, clients select session identifiers without coordinating with other clients and hence, a client may not know the up-to-date value of that may have been set by an earlier transaction from another client. A simple optimization alleviating this issue is to let clients lazily synchronize their knowledge of maximum timestamps. More specifically, each client can broadcast its local updates on and to other clients at some fixed broadcast rate ( ) and other clients can update their local estimates accordingly. We implemented this optimization and measured its effects on the tree-small workload with 4 storage targets. Figure 7(right) shows the results, which suggest that we can substantially reduce the rate of rejection by broadcasting with and the resulting goodput closely approaches the maximum value achievable under strong locking. Note that this optimization affects only performance and is not required for safety. Conceptually, the broadcast rate provides a way of parameterizing the continuum between traditional locking and the fully optimistic case of weak-own and other methods may be possible.
In this section, we discuss several issues pertaining to the practical feasibility of our approach and the implications of Minuet's programming model.
|