Pallavi Srivastava Kartik Palani

1 Pallavi Srivastava Kartik PalaniConsensus census Pallav...
Author: Jean Ross
0 downloads 2 Views

1 Pallavi Srivastava Kartik PalaniConsensus census Pallavi Srivastava Kartik Palani

2 State machine replicationA fault tolerant service where servers are replicated to coordinate interactions with the client. Multiple copies of the same State Machine begin in the Start state, and receiving the same Inputs in the same order will arrive at the same State having generated the same Outputs

3 Types of failures Crash Stop Failures : A failed replica is guaranteed to stop without generating outputs Byzantine Failures : A replica sends different values in different directions (for instance, the correct output to some of its fellow replicas and incorrect outputs to others) 

4 Liveness and safety Informally, a safety property stipulates that "bad things" do not happen during execution of a program and a liveness property stipulates that "good things" do happen (eventually) (Lamport 1977)

5 LineariZability and Total orderA history of events is linearizable if there is a linear order of the completed operations such that : Its invocations and responses can be reordered to yield a sequential history That sequential history is correct according to the sequential definition of the object If a response preceded an invocation in the original history, it must still precede it in the sequential reordering Total order: Messages are delivered to the replicas in the same relative order

6 just SAY NO TO Paxos OVERHEAD

7 The Big picture

8 Can We Eliminate Paxos Overhead?State Machine Replication used for data availability and masking server failures Need Strongly Consistent Data without extensive coordination Data Center networks are different than the Internet Possible to exploit the Network Ordering to ensure the above

9 Replica Consistency Traditionally application level Paxos used for linearizability Paxos decides if a client request should be accepted or dropped Agrees on the order in which requests are processed Significant performance overhead incurred on each request An alternative Totally Ordered Atomic Broadcast Primitive supported by the network is equivalent to solving consensus

10 The Solution? An Ordered Unreliable Multicast (OUM) Primitive by the Network with low implementation costs + A (near) coordination-free replication protocol: Network-Ordered Paxos

11 Splitting ResponsibilityWe need a singly totally ordered set of Operations Ordering : If some replica processes request a before b, no replica processes b before a Reliable Delivery : Every request submitted by a client is either processed by all replicas or none The network fabric takes care of ordering while NOPaxos is responsible for reliable delivery

12 Ordered Unreliable Multicast ModelImplemented using a combination of a network-layer sequencer and a communication library called libOUM A Sequencer is designated in the network The messages are sent to the sequencer The sequencer increments a counter and writes the value to the packet header Receivers use the sequence number to detect missing packets and packet reordering

13 The OUM Workflow A set of receivers is identified by a common IP address and called a Group Stream of messages sent to a group is divided in consecutive sessions Within a session, on a asynchronous and unreliable network, ordered multicast and multicast drop detection are guaranteed The libOUM interface exists on both sender and receiver ends to faciliate this If a session terminates, it is the replica’s responsibility to restart in a mutually consistent state

14 Sequencer ImplementationSequencers maintain one counter per OUM group The paper suggests three possible implementations 1. Programmable switches (latency ~ 300 ns/packet) 2. Middlebox sequencing using a Network processor and OpenFlow switches 3. End host using a software based approach (latency ~ µs/packet)

15 Fault Tolerance Sequencer – Common point of failureIf sequencer unreachable because of link or other switch failures, central controller reroutes If the sequencer itself fails, the ordering guarantee of multicast must not be lost A monotonically increasing session number is introduced by the controller Clients receive a < Session #, Sequence# > tuple and can discard out of order messages LibOUM detects the presence of a new session A Drop Notification cannot be delivered during a session change for missed packets so a Session Terminated Notification is sent NoPaxos resolves this by initiating a view change (discussed later)

16 NOPaxos Built on top of OUM guarantees Uses a View based approachAssumes crash failure with 2f + 1 replicas ( f can fail ) Only responsibility is to decide if a Request should be executed or not Equivalent to binary consensus Only needs to run it if Drop Notifications are received Normal Case requires no co-ordination among replicas

17 Normal Case Each view has a single session number and leaderAll replicas reply to a client request Only the leader executes the request and sends the result Client only accepts result if F+1 replicas reply with matching semantics

18 GAP agreement If the replicas receive a Drop Notification, it must either recover the missing packet from the leader or prevent it from succeeding If the leader itself receives a Drop Notification, it initiates a No-OP commit The No-OP is committed when it receives f replies

19 View Change Needed on leader failure or on a session terminatedAll successful operation from old view are carried over to the new view Replicas start from a consistent state New leader merges the logs from the replicas to create a stable log of successful requests and No-Ops. It then executes all operations before it can process new ones As an optimization, a synchronization protocol is periodically used

20 Performance- Latency and Throughput3-level fat-tree network testbed was used All clients and replicas ran on servers with 2.5 GHz Intel Xeon E processors and 64GB of RAM on five replicas NOPaxos attains throughput within 2% of an unreplicated system and latency within 16 microseconds. This is even without using switches as sequencers (8 microseconds less)

21 Network Anomalies and ScalabilityNOPaxos maintains throughput on packet drops as no expensive reconciliation protocol The lack of coordination messages in the normal case between replicas ensures that the throughput stays largely unaffected as replicas increase

22 XFT: Practical Fault Tolerance beyond Crashes  

23 Reliability Guarantees: CFT ReliabilityDoes not consider any non-crash faults Does very well for crash-faults

24 Reliability Guarantees: BFT ReliabilityDoes consider non-crash faults But does poorly in most things it does. Especially for crash faults, it does not even guarantee what CFT guarantees

25 The XFT Model: The big pictureDo what CFT does and do a little more Use same number of replicas as asynchronous CFT Preserve all reliability guarantees of asynchronous CFT Provide correct service in case of Byzantine failures (as long as things aren’t too crazy) Majority of replicas are correct and communicate with each other synchronously Based on the observation that total craziness is a rare event

26 The XFT Model: System ModelMachines Clients and replicas may suffer from Byzantine faults Correct, benign Network Asynchronous: network faults possible Delay 𝚫 known to all replicas Assumption: eventually synchronous system Eventually no network faults occur

27 The XFT Model: DefinitionsAnarchy A severe system condition in which there is: at least one non-crash-faulty replica # of non-crash-faulty replicas + # of crash-faulty replicas + # of correct, but partitioned replicas > t where t ≤ (n-1)/2 for n replicas. XFT Protocol Protocol P is an XFT protocol if P satisfies safety in all executions in which the system is never in anarchy. Partitioned replica If not synchronous

28 XFT vs. CFT/BFT

29 The XFT Model: Use CasesDeployments where “extremely bad” system conditions are very rare Adversary can’t cause anarchy Tolerating “accidental” non-crash faults No active adversary Wide-area networks and geo-replicated systems Too much work for an adversary Blockchain Network bottleneck: high link latency and low bandwidth

30 XPaxos Protocol: OverviewCommon case protocol Replicates requests across replicas Similar message patterns as Paxos Digital signatures View-change protocol Decentralized and leaderless Fault detection mechanism Goal is to minimize the impact of long-lived non-crash faults

31 XPaxos: Common Case Synchronous group: t+1Active replicas: 1 primary + t followers Passive replicas: t Commit logs: maintain consistency in view change Retransmission timer

32 XPaxos: Common Case 2 replicas Highly relevant in practiceLazy replication

33 XPaxos: View Change View not changed unless there is a machine or network fault within a synchronous group Retransmission timer Invalid signature Suspect from another sgi

34 XPaxos: Fault DetectionOptional: use with long-lived faults Detects potential faults outside anarchy What happens if a non-faulty replica loses some of its commit log Exchange prepare logs along with commit logs

35 Evaluation: Setup XPaxos vs Zyzzyva vs PBFT vs Paxos (WAN optimized)Ping RTT across 6 Amazon EC2 datacenters Globally distributed EC2 VMs 8 vCPUs, 15GB memory, 2 x 80 GB SSD HMAC-SHA1 and RS1024 Microbenchmark to stress the protocols Request size/reply size: 1kB/0kB -> 1/0 and 4kB/0kB ->4/0

36 Performance: Fault-Free

37 Performance: With Faults1/0 benchmark t =1 2500 clients in CA CA and VA active replicas 180s, 300s – crash VA, CA Timeout 2𝚫 = 2.5 s View change < 10 s

38 Performance: Macro-benchmark (ZooKeeper)Modified Zookeeper to replace Zab t =1 Request = 1kB CA Better than Zab due to 2t vs t

39 Discussion ( nOPaxos) What is the cost of moving from a fixed replica group to a changing membership? What is the back ground cost of the synchronization protocol? Is the protocol scalable? Can the protocol be extended to span over a bigger network ( than just a datacenter?) Will the cost of replacing/programming hardware prove to be a deterring factor?

40 Discussion (XFT) Limited evaluation Questioning assumptionsIs it scalable? t =1 Within US Less synchronous network Is the estimate of 𝚫 reasonable Security overhead Questioning assumptions Treats asynchronous networks as a fault case rather than as the normal case Evidence for the validity of weakening the network failure models Machine and network faults occur IID. Future work Comparison to hybrid fault models Does the view change protocol improve performance of other models Is it good in blockchain

41 DiscussioN Can XFT benefit from NOPaxos?Can the handling of non-crash failures be worked into a NOPaxos style method, or is such a system required for handling adversarial cases?