15-440 Distributed Systems Review.

1 Distributed Systems Review ...
Author: Eunice Paul
0 downloads 4 Views

1 Distributed Systems Review

2 Naming

3 Names Names are associated with objects ExamplesEnables passing of references to objects Indirection Deferring decision on meaning/binding Examples Registers  R5 Memory  0xdeadbeef Host names  srini.com User names  sseshan File name  /usr/srini/foo.txt URLs  Ethernet  f8:e4:fb:bf:3d:a6

4 Domain Name System GoalsBasically a wide-area distributed database Scalability Decentralized maintenance Robustness Global scope Names mean the same thing everywhere Don’t need Atomicity Strong consistency ACID Atomic Consistent Isolated Durable

5 RR format: (class, name, value, type, ttl)DNS Records RR format: (class, name, value, type, ttl) DB contains tuples called resource records (RRs) Classes = Internet (IN), Chaosnet (CH), etc. Each class defines value associated with type FOR IN class: Type=A name is hostname value is IP address Type=NS name is domain (e.g. foo.com) value is name of authoritative name server for this domain Type=CNAME name is an alias name for some “canonical” (the real) name value is canonical name Type=MX value is hostname of mailserver associated with name

6 DNS Design: Zone DefinitionsZone = contiguous section of name space E.g., Complete tree, single node or subtree A zone has an associated set of name servers Must store list of names and tree links root org ca net edu com uk gwu ucb cmu bu mit cs ece Subtree cmcl Single node Complete Tree

7 Physical Root Name ServersSeveral root servers have multiple physical servers Packets routed to “nearest” server by “Anycast” protocol 346 servers total

8 Typical Resolution root & edu DNS server ns1.cmu.edu DNS server LocalNS ns1.cmu.edu ns1.cmu.edu DNS server NS ns1.cs.cmu.edu A www=IPaddr Local DNS server Client ns1.cs.cmu.edu DNS server

9 Subsequent Lookup Exampleroot & edu DNS server ftp.cs.cmu.edu cmu.edu DNS server Local DNS server Client ftp.cs.cmu.edu cs.cmu.edu DNS server ftp=IPaddr

10 Prefetching Name servers can add additional data to responseTypically used for prefetching CNAME/MX/NS typically point to another host name Responses include address of host referred to in “additional section”

11 Tracing Hierarchy (3 & 4) 3 servers handle CMU CS namesServer within CS is “start of authority” (SOA) for this name unix> dig NS greatwhite.ics.cs.cmu.edu ;; AUTHORITY SECTION: cs.cmu.edu IN NS AC-DDNS-2.NET.cs.cmu.edu. cs.cmu.edu IN NS AC-DDNS-1.NET.cs.cmu.edu. cs.cmu.edu IN NS AC-DDNS-3.NET.cs.cmu.edu. unix>dig NS greatwhite.ics.cs.cmu.edu ;; AUTHORITY SECTION: cs.cmu.edu IN SOA PLANISPHERE.FAC.cs.cmu.edu.

12 Hashing Two uses of hashing that are becoming wildly popular in distributed systems: Content-based naming Consistent Hashing of various forms

13 Consistent Hash “view” = subset of all hash buckets that are visibleDesired features Balanced – in any one view, load is equal across buckets Smoothness – little impact on hash bucket contents when buckets are added/removed Spread – small set of hash buckets that may hold an object regardless of views Load – across all views # of objects assigned to hash bucket is small

14 Consistent Hash – ExampleConstruction Assign each of C hash buckets to random points on mod 2n circle, where, hash key size = n. Map object to random position on circle Hash of object = closest clockwise bucket 4 12 Bucket 14 8 Smoothness  addition of bucket does not cause much movement between existing buckets Spread & Load  small set of buckets that lie near object Balance  no bucket is responsible for large number of objects

15 Name items by their hashImagine that your filesystem had a layer of indirection: pathname  hash(data) hash(data)  list of blocks For example: /src/foo.c -> 0xfff32f2fa11d00f0 0xfff32f2fa11d00f0 -> [5623, 5624, 5625, 8993] If there were two identical copies of foo.c on disk ... We’d only have to store it once! Name of second copy can be different

16 Self-Certifying NamesSeveral p2p systems operate something like: Search for “national anthem”, find a particular file name (starspangled.mp3). Identify the files by the hash of their content (0x2fab4f001...) Request to download a file whose hash matches the one you want Advantage? You can verify what you got, even if you got it from an untrusted source (like some dude on a p2p network)

17 Hash functions Given a universe of possible objects U, map N objects from U to an M-bit hash. Typically, |U| >>> 2M. This means that there can be collisions: Multiple objects map to the same M-bit representation. Likelihood of collision depends on hash function, M, and N. Birthday paradox  roughly 50% collision with 2M/2 objects for a well designed hash function

18 Desirable Properties (Cryptographic Hashes)Compression: Maps a variable-length input to a fixed-length output Ease of computation: A relative metric... Pre-image resistance: Given a hash value h it should be difficult to find any message m such that h = hash(m) 2nd pre-image resistance: Given an input m1 it should be difficult to find different input m2 such that hash(m1) = hash(m2) collision resistance: difficult to find two different messages m1 and m2 such that hash(m1) = hash(m2)

19 Popular Systems

20 Content Distribution Networks (CDNs)The content providers are the CDN customers. Content replication CDN company installs hundreds of CDN servers throughout Internet Close to users CDN replicates its customers’ content in CDN servers. When provider updates content, CDN updates servers origin server in North America CDN distribution node CDN server in S. America CDN server in Asia CDN server in Europe 20

21 Server Selection Which server?Lowest load  to balance load on servers Best performance  to improve client performance Based on Geography? RTT? Throughput? Load? Any alive node  to provide fault tolerance How to direct clients to a particular server? As part of routing  anycast, cluster load balancing Not covered  As part of application  HTTP redirect As part of naming  DNS 21

22 Naming Based Client does name lookup for serviceName server chooses appropriate server address A-record returned is “best” one for the client What information can name server base decision on? Server load/location  must be collected Information in the name lookup request Name service client  typically the local name server for client 22

23 How Akamai Works Clients delegate domain to akamaiibm.com IN NS usw2.akam.net. CNAME records eventually lead to Something like e2874.x.akamaiedge.net. For IBM Or a1686.q.akamai.net for IKEA…. Client is forced to resolve eXYZ.x.akamaiedge.net. hostname 23

24 How Akamai Works Root server gives NS record for akamai.netAkamai.net name server returns NS record for x.akamaiedge.net Name server chosen to be in region of client’s name server TTL is large x.akamaiedge.net nameserver chooses server in region Should try to chose server that has file in cache - How to choose? Uses eXYZ name and consistent hashing TTL is small  why? 24

25 How Akamai Works End-user cnn.com (content provider) DNS root serverAkamai server Get foo.jpg 10 9 3 1 Akamai high-level DNS server 4 2 5 Akamai low-level DNS server 6 Nearby matching Akamai server 7 8 End-user Get /cnn.com/foo.jpg 25

26 Akamai – Subsequent Requestscnn.com (content provider) DNS root server Akamai server Assuming no timeout on NS record Akamai high-level DNS server 7 Akamai low-level DNS server 8 Nearby matching Akamai server 9 10 End-user Get /cnn.com/foo.jpg 26

27 Consistent Hashing Reminder…Finding a nearby server for an object in a CDN uses centralized knowledge. Consistent hashing can also be used in a distributed setting Consistent Hashing to the rescue. 27

28 Peer-to-Peer NetworksTypically each member stores/provides access to content Basically a replication system for files Always a tradeoff between possible location of files and searching difficulty Peer-to-peer allow files to be anywhere  searching is the challenge Dynamic member list makes it more difficult

29 The Lookup Problem N2 N1 N3 ? N4 N6 N5 Key=“title” Value=MP3 data…1000s of nodes. Set of nodes may change… Key=“title” Value=MP3 data… Internet ? Client Publisher Lookup(“title”) N4 N6 N5

30 Searching Needles vs. Haystacks Search expressivenessSearching for top 40, or an obscure punk track from 1981 that nobody’s heard of? Search expressiveness Whole word? Regular expressions? File names? Attributes? Whole-text search? (e.g., p2p gnutella or p2p google?)

31 Framework Common Primitives: Join: how to I begin participating?Publish: how do I advertise my file? Search: how to I find a file? Fetch: how to I retrieve a file?

32 Napster: Overiew Centralized Database:Join: on startup, client contacts central server Publish: reports list of files to central server Search: query the server => return someone that stores the requested file Fetch: get the file directly from peer

33 “Old” Gnutella: OverviewQuery Flooding: Join: on startup, client contacts a few other nodes; these become its “neighbors” Publish: no need Search: ask neighbors, who ask their neighbors, and so on... when/if found, reply to sender. TTL limits propagation Fetch: get the file directly from peer

34 Gnutella: Search I have file A. Reply Query Where is file A?

35 BitTorrent: Overview Swarming: Big differences from Napster:Join: contact centralized “tracker” server, get a list of peers. Publish: Run a tracker server. Search: Out-of-band. E.g., use Google to find a tracker for the file you want. Fetch: Download chunks of the file from your peers. Upload chunks you have to them. Big differences from Napster: Chunk based downloading “few large files” focus Anti-freeloading mechanisms

36 BitTorrent: Fetch

37 BitTorrent: Sharing StrategyEmploy “Tit-for-tat” sharing strategy A is downloading from some other people A will let the fastest N of those download from him Be optimistic: occasionally let freeloaders download Otherwise no one would ever start! Also allows you to discover better peers to download from when they reciprocate Goal: Pareto Efficiency Game Theory: “No change can make anyone better off without making others worse off” Does it work? (not perfectly, but perhaps good enough?)

38 DHT: Overview (1) Goal: make sure that an item (file) identified is always found in a reasonable # of steps Abstraction: a distributed hash-table (DHT) data structure insert(id, item); item = query(id); Note: item can be anything: a data object, document, file, pointer to a file… Implementation: nodes in system form a distributed data structure Can be Ring, Tree, Hypercube, Skip List, Butterfly Network, ...

39 Routing: Chord ExamplesNodes: n1:(1), n2(3), n3(0), n4(6) Items: f1:(7), f2:(2) Succ. Table Items i id+2i succ 7 Succ. Table Items 1 7 i id+2i succ 1 Succ. Table 6 2 i id+2i succ Succ. Table i id+2i succ 5 3 4

40 Routing: Query Upon receiving a query for item id, a nodeCheck whether stores the item locally If not, forwards the query to the largest node in its successor table that does not exceed id Succ. Table Items i id+2i succ 7 Succ. Table Items 1 7 i id+2i succ 1 query(7) Succ. Table 6 2 i id+2i succ Succ. Table i id+2i succ 5 3 4

41 DHT: Discussion Pros: Cons: Guaranteed LookupO(log N) per node state and search scope Cons: Supporting non-exact match search is hard

42

43 Data Intensive Computing + MapReduce/Hadoop/ Lecture 19, November 8th 2016 Data Intensive Computing + MapReduce/Hadoop Topics Large-scale computing Traditional high-performance computing (HPC) Cluster computing MapReduce Definition Examples Implementation Alternatives to MapReduce Properties

44

45 VM…

46

47

48

49

50 MapReduce Example Sum 1 3 6 M Extract Word-Count Pairsdick 1 and 3 come 6 see spot M Extract Word-Count Pairs dick, 1 see, 1 come, 1 and, 1 come, 2 spot, 1 Come, Dick Come, come. Come and see Spot. Come and see. Come and see. Map: generate word, count pairs for all words in document Reduce: sum word counts across documents

51

52

53

54 MapReduce Execution (authors view)Task Manager == one “primary” node in the MapReduce paper. Only one needed really, since if that crashes the system is meant to just choose another primary node (of course this primary node can checkpoint etc for failure recovery). From the paper: In MapReduce the file chunks/blocks are usually 16MB to 64MB

55

56

57

58

59

60

61 Lecture 20 Virtual Machines Tuesday Nov 16th, 2015Distributed Systems Lecture 20 Virtual Machines Tuesday Nov 16th, 2015 Some slides based on material from: Ken Cornell, CS6410, Eyal Utoronto, ECE1799 JP Singh, Princeton, COS 318, Alex UCSD (CSE 120), OS Concepts (Sliberschatz, Galvin, Gagne 2013)

62 What is a Virtual Machine?Software Abstraction Behaves like hardware Encapsulates all OS and application state Virtualization Layer Extra level of indirection Decouples hardware, OS Enforces isolation Multiplexes physical hardware across VMs Virtualization can be defined many ways. I will try to define it formally and also define it by giving a few examples. However loosely, virtualization is the addition of a software layer (the virtual machine monitor) between the hardware and the existing software that exports an interface at the same level as the underlying hardware. In the strictest case the exported interface is the exact same as the underlying hardware and the virtual machine monitor provides no functionality except multiplexing the hardware among multiple VMs. This was largely the case in the old IBM VM/360 systems. However the layer really can export a different hardware interface as the case in cross-ISA emulators. Also the layer can provide additional functionality not present in the operating system. You can think of virtualization as the addition of a layer of software that can run the original software with little or no changes. Adapted from: Eyal DeLara

63 Virtualization Properties, FeaturesIsolation Fault isolation Performance isolation (+ software isolation, …) Encapsulation Cleanly capture all VM state Enables VM snapshots, clones Portability Independent of physical hardware Enables migration of live, running VMs (freeze, suspend,…) Clone VMs easily, make copies Interposition Transformations on instructions, memory, I/O Enables transparent resource overcommitment, encryption, compression, replication … Virtualization has three main properties that give rise to all its applications. Isolation First, virtualization provides isolation. Isolation is key for many applications and comes in several flavors. Fault Isolation. If one virtual machine contains a buggy operating system, that OS can start scribbling all over physical memory. These wild rights must be contained within the VM boundaries. Performance Isolation. Ideally VMs performance would be independent of the activity going-on on the hardware. This must be accomplished by smart scheduling and resource allocation policies in the monitor. Software Isolation. Most of the issues with computers today are complex software configurations. DLL hell on PCs, operating system and library versions, viruses, and other security threats. VMs are naturally isolated for each other by running in separate software environments. Encapsulation Encapsulation is the property that all VM state can be described and recorded simply. The VM state is basically the dynamic memory, static memory, and the register state of the CPU and devices. These items typically have a simple layout and are easy to describe. We can checkpoint a VM by writing out these items to a few files. The VM can be moved and copied by moving these files around. You can think about this as similar to doing a backup at the block level vs. doing a backup by recording all the packages, configuration and data files that encompass a file system. Interposition At some level all access to the hardware passes through the virtual machine monitor first. This gives the monitor and chance to operate on these accesses. The best example of this is encrypting all data written to a disk. The advantage of this is that it does it without the knowledge of the OS. Why not in the OS? This brings up a good point. Why not do these things in the OS. By splitting up the system this way the OS functions more like a large application library. The VMM functions more like a smart set of device drivers. This is a nice split and can simplify overall system design. It also provides a natural administration boundary. However the monitor is often at a disadvantage because it does not have the same insight into what’s happening as the OS has. For example, the OS knows the distinction between data and metadata when implementing an encrypted file system. So there is a tradeoff there. Expanded capabilities. Want to use one machine to support multiple OS’s simultaneously. E.g., Parallels program makes it possible to run Windows applcations on Macintosh. Does this by creating a VMM with MacOS as host, and Windows (or Linux) OS as guest. Can be done efficiently partly because both MacOS & Windows run on x86, so can directly execute most of the Windows & application code. Resource sharing. Want to schedule many different programs onto a collection of machines, while providing high degree of isolation between them. Example: Amazon Web Services. Companies can rent virtualized machines by the hour. Use to run large data centers. Advantage of renting vs. buying: can dynamically change number of machines according to load, and let others deal with system administration. Much cheaper than buying & operating enough machines to meet peak load: rely on averages, and cost / CPU less than purchase & operate cost. (This was the original motivation for virtualization. IBM had big mainframe computers. Wanted to have multiple clients, each using OS of its choice, on single machine.) - Security. Example: Autolab system used by multiple courses for student grading. Runs user program within virtual machine that allows incoming network connection, but no outgoing one. Virtualization makes it impossible for student program to modify any files or initiate network connections. - Flexibility. Can migrate complete "system" from one machine to another, including the OS and all applications running on it. - Testing / debugging. E.g., Jiri Simsa’s dBug program. Runs many tests on threaded program to look for concurrency bug. Each time, reloads image of machine’s starting point and then traces through new test sequence. VM provides simple way to checkpoint entire machine state and restart from saved state. Adapted from: Eyal DeLara

64 Types of VirtualizationProcess Virtualization (Figure [a]) Language-level Java, .NET, Smalltalk OS-level processes, Solaris Zones, BSD Jails, Docker Containers Cross-ISA emulation Apple 68K-PPC-x86 System Virtualization (Figure [b]) VMware Workstation, Microsoft VPC, Parallels VMware ESX, Xen, Microsoft Hyper-V Virtual systems Abstract physical components/details using logical objects Dynamically bind logical objects to physical configurations We will mostly be focusing on “Sytem Virtualization” but its useful to talk about the other forms of virtualizaiton you are familiar with. Adapted from: Eyal DeLara

65 Types of System VirtualizationNative/Bare metal (Type 1) Higher performance ESX, Xen, HyperV Hosted (Type 2) Easier to install Leverage host’s device drivers VMware Workstation, Parallels Adapted from: Eyal DeLara Attribution:

66 Types of VirtualizationFull virtualization (e.g. VMWare ESX) Unmodified OS, virtualization is transparent to OS VM looks exactly like a physical machine Para virtualization (e.g. XEN) OS modified to be virtualized, Better performance at cost of transparency Adapted from: Eyal DeLara Attribution

67 VMM Implementation GoalsShould efficiently virtualize the hardware Provide illusion of multiple machines Retain control of the physical machine Which subsystems should be virtualized? Processor => Processor Virtualization Memory => Memory Virtualization I/O Devices => I/O virtualization

68 Processor VirtualizationDirect execution Popek and Goldberg (1974)  - Sensitive instructions: only executed in kernel mode  - Privileged instructions: trap when run in user mode  - CPU architecture is virtualizable only if sensitive instructions are subset of privileged instructions When guest OS runs a sensitive instruction, must trap to VMM so it maintains control An architecture is classically/strictly virtualizable if all its sensitive instructions (those that violate safety and encapsulation) are a subset of the privileged instructions. all instructions either trap or execute identically instructions that access privileged state trap Attribution: Adapted from: Eyal DeLara

69 System Call Example Run guest operating system deprivilegedPriviledge Levels generally VMM > Guest OS > Application Run guest operating system deprivileged All privileged instructions trap into VMM VMM emulates instructions against virtual state e.g. disable virtual interrupts, not physical interrupts Resume direct execution from next guest instruction Adapted from: JP Princeton

70 Virtualizing Memory OS assumes that it has full control over memoryManagement: Assumes it owns it all Mapping: Assumes it can map any Virtual-> Physical However, VMM partitions memory among VMs VMM needs to assign hardware pages to VMs VMM needs to control mapping for isolation Cannot allow OS to map any Virtual => hardware page Hardware-managed TLBs make this difficult On TLB misses, the hardware walks page tables in mem VMM needs to control access by OS to page tables - Cannot allow OS to map any Virtual => hardware page - Thus, OS can only map to hardware page given by VMM Adapted from: Alex Snoeren

71 x86 Memory Management PrimerThe processor operates with virtual addresses Physical memory operates with physical addresses x86 includes a hardware translation lookaside bufer (TLB) Maps virtual to physical page addresses x86 handles TLB misses in HW HW walks the page tables => Inserts virtual to physical mapping CR3 points to page table root

72 Shadow Page Tables Three abstractions of memoryMachine: actual hardware memory (e.g. 2GB of DRAM) Physical: abstraction of hardware memory, OS managed E.g. VMM allocates 512 MB to a VM, the OS thinks the computer has 512 MB of contiguous physical memory (Underlying machine memory may be discontiguous) Virtual: virtual address space Standard 2^32 address space In each VM, OS creates and manages page tables for its virtual address spaces without modification But these page tables are not used by the MMU MMU = hardware memory managemeent unit

73 Three Abstractions of memoryNative Virtualized Virtual Pages In a traditional system there are typically 2 address spaces – the virtual address space (VAS) and the physical address space (PAS). The OS and user processes run in the VAS. The OS manages the mapping from VAS to PAS through the use of the Memory Management Unit (MMU) provided in the processor. The OS maintains a page table that maps each page in the current VAS to a page in the PAS. Typically the OS will maintain one page table per user level process. Virtual Pages

74 Memory Management / AllocationVMMs tend to have simple memory management Static policy: VM gets 8GB at start Dynamic adjustment is hard since OS cannot handle No swapping to disk More sophistication: Overcommit with balooning Baloon driver runs inside OS => consume hardware pages Baloon grows or shrinks (gives back mem to other VMs) Even more sophistication: memory de-duplication Identify pages that are shared across VMs!

75 I/O Virtualization Challenge: Lots of I/O devicesProblem: Writing device drivers for all I/O devices in the VMM layer is not a feasible option  Insight: Device driver already written for popular Operating Systems Solution: Present virtual I/O devices to guest VMs and channel I/O requests to a trusted host VM (popular OS)

76 VM Storage Management VMM provides both boot disk + other storageType 1 VMM – storage guest root disks and config information within file system provided by VMM as a disk image Type 2 VMM – store as files in the host file system/OS Example of supported operations: Duplicate file (clone) -> create new guest Move file to another system -> move guest Convert formats: Physical-to-virtual (P-to-V) and Virtual-to-physical (V-to-P) VMM also usually provides access to network attached storage (just networking) => live migration Physical-to-virtual (P-to-V) and convert native disk blocks into VMM format Virtual-to-physical (V-to-P) convert from virtual format to native or disk format

77 Summary Introduction to Virtualization: why do it?Different types of Virtualization: Container based, Language Based, Type I/II, Paravirtualized. VMM techniques to virtualize the CPU, memory, I/O devices, etc Example: XEN, open source VMM .

78 Reliability

79 Measuring AvailabilityMean time to failure (MTTF) Mean time to repair (MTTR) MTBF = MTTF + MTTR Availability = MTTF / (MTTF + MTTR) Suppose OS crashes once per month, takes 10min to reboot. MTTF = 720 hours = 43,200 minutes MTTR = 10 minutes Availability = / = (~“3 nines”)

80 Disk failure conditional probability distribution - Bathtub curveInfant mortality Burn out 1 / (reported MTTF) Expected operating lifetime

81 What are our options? Silently return the wrong answer.Detect failure. Correct / mask the failure

82 Parity Checking Single Bit Parity: Detect single bit errors

83 Error Detection - ChecksumUsed by TCP, UDP, IP, etc.. Ones complement sum of all words/shorts/bytes in packet Simple to implement Relatively weak detection Easily tricked by typical error patterns – e.g. bit flips

84 Error Detection – CRC View data bits, D, as a binary numberChoose r+1 bit pattern (generator), G Goal: choose r CRC bits, R, such that exactly divisible by G (modulo 2) Receiver knows G, divides by G. If non-zero remainder: error detected! Can detect all burst errors less than r+1 bits Widely used in practice

85 Error Recovery Two forms of error recovery ECC Redundancy RetryError Correcting Codes (ECC) Replication/Voting Retry ECC Keep encoded redundant data to help repair losses Forward Error Correction (FEC) – send bits in advance Reduces latency of recovery at the cost of bandwidth

86 Error Recovery – Error Correcting Codes (ECC)Two Dimensional Bit Parity: Detect and correct single bit errors

87 Retry – Network ExampleSometimes errors are transient Need to have error detection mechanism E.g., timeout, parity, chksum No need for majority vote Client Server RPC Time Timeout ACK

88 One key question How correlated are failures?Can you assume independence? If the failure probability of a computer in a rack is p, What is p(computer 2 failing) | computer 1 failed? Maybe it’s p... or maybe they’re both plugged into the same UPS... Why is this important? Correlation reduces value of redundancy

89 Replacement Rates HPC1 COM1 COM2 Component % Hard drive 30.6Power supply 34.8 49.1 Memory 28.5 20.1 Motherboard 23.4 Misc/Unk 14.4 18.1 10.1 CPU 12.4 Case 11.4 RAID card 4.1 motherboard 4.9 Fan 8 3.4 Controller 2.9 2 SCSI cable 2.2 QSW 1.7 SCSI Board 0.6 1.6 NIC Card 1.2 MLB 1 LV Pwr Board CD-ROM SCSI BP 0.3 CPU heatsink Raid Controller

90 Motivation: Why use multiple disks?Capacity More disks allows us to store more data Performance Access multiple disks in parallel Each disk can be working on independent read or write Overlap seek and rotational positioning time for all Reliability Recover from disk (or single sector) failures Will need to store multiple copies of data to recover So, what is the simplest arrangement?

91 Just a bunch of disks (JBOD)Yes, it’s a goofy name industry really does sell “JBOD enclosures”

92 Disk Striping Interleave data across multiple disksLarge file streaming can enjoy parallel transfers High throughput requests can enjoy thorough load balancing If blocks of hot files equally likely on all disks (really?)

93 Redundancy via replicasTwo (or more) copies mirroring, shadowing, duplexing, etc. Write both, read either

94 Simplest approach: Parity DiskCapacity: one extra disk needed per stripe

95 Updating and using the parity

96 Solution: Striping the ParityRemoves parity disk bottleneck

97 RAID-5: Rotated Parity Rotate location of parity across all disks3 1 4 2 P1 P0 5 6 P3 P2 9 7 10 8 11 Capacity: Reliability: Performance: Reads: Writes: Still requires 4 I/Os per write, but not always to same parity disk

98 Comparison N x R Key takeaways: writes are expensive, small writes are really expensive! File systems may help (see LFS)

99 How often are failures? MTBF (Mean Time Between Failures)MTBFdisk ~ 1,200,00 hours (~136 years, <1% per year) MTBFmutli-disk system = mean time to first disk failure which is MTBFdisk / (number of disks) For a striped array of 200 drives MTBFarray = 136 years / 200 drives = 0.65 years

100 Reliability without rebuild200 data drives with MTBFdrive MTTDLarray = MTBFdrive / 200 Add 200 drives and do mirroring MTBFpair = (MTBFdrive / 2) + MTBFdrive = 1.5 * MTBFdrive MTTDLarray = MTBFpair / 200 = MTBFdrive / 133 Add 50 drives, each with parity across 4 data disks MTBFset = (MTBFdrive / 5) + (MTBFdrive / 4) = 0.45 * MTBFdrive MTTDLarray = MTBFset / 50 = MTBFdrive / 111 approximate see note The last step here is an approximation that an array is a single reliable "virtual drive”. Note that this doesn't quite work in some cases. The key issue (from the original RAID paper http://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf): "The second step is the reliability of the whole system, which is approximately (since the MTTF_group is not distributed exponentially) MTTF_group/num_groups" The problem is that this estimate breaks down dramatically at large numbers. For example: What is the mean time to data loss for a system with 100,000 disk, which are organized into 10, disk arrays, using data striping and striped parity (i.e., RAID 5)? Assume each disk has an MTBF of 100 years. MTBFarray =(100years/10)+(100years/9)=21years MTTDL = MTBFarray/10,000 = 19hrs But…. MTBF for a single drive in the entire system is 100yr/ drives approx = 8.9 hours. MTBF for three drives = 3 x 8.9hrs = 26.7hrs which is more than MTBFarray The system shouldn’t lose data before three drives fail. Why? Because MTBF_array / arrays actually calculates E[time first drive fails in any array] + E[time second drive fails in any array | first drive fails in every array]. However what you really want is E[time first drive fails in any array] + E[time second drive fails in any array | first drive failed in a single array], and the second terms aren't the same.

101 Reliability consequences of adding rebuildNo data loss, if fast enough That is, if first failure fixed before second one happens New math is... MTTDLarray = MTBFfirstdrive * (1 / prob of 2nd failure before repair)  ... which is MTTRdrive / MTBFseconddrive For mirroring MTBFpair = (MTBFdrive / 2) * (MTBFdrive / MTTRdrive) For 5-disk parity-protected arrays MTBFset = (MTBFdrive / 5) * ((MTBFdrive / 4 )/ MTTRdrive)

102 Three modes of operationNormal mode everything working; maximum efficiency Degraded mode some disk unavailable must use degraded mode operations Rebuild mode reconstructing lost disk’s contents onto spare degraded mode operations plus competition with rebuild

103 Byzantine Fault ToleranceTerminology & Background Byzantine Fault Tolerance (Lamport) Async. BFT (Liskov)

104 Agreement in Faulty Systems - 5System of N processes, where each process i will provide a value vi to each other. Some number of these processes may be incorrect (or malicious) Goal: Each process learn the true values sent by each of the correct processes The Byzantine agreement problem for three nonfaulty and one faulty process.

105 Paxos under Byzantine faultsaccept vid=1, myn=N1:1, val=abc OK N2 X N0 N0 decides on Vid1=xyz N1 N1 decides on Vid1=abc nh=N0:1 nh=N1:1 Agreement conflict!

106 BFT requires a 2f+1 quorum out of 3f+1 nodes 1. State: 2. State: 3. State: 4. State: A A A Servers X write A write A write A write A Clients For liveness, the quorum size must be at most N - f

107 BFT Quorums 1. State: 2. State: 3. State: 4. State: A A B B B Servers X write B write B write B write B Clients For correctness, any two quorums must intersect at least one honest node: (N-f) + (N-f) - N >= f N >= 3f+1

108 Distributed systems FLP impossibility: Async consensus may not terminate Sketch of proof: System starts in “bivalent” state (may decide 0 or 1). At some point, the system is one message away from deciding on 0 or 1. If that message is delayed, another message may move the system away from deciding. Holds even when servers can only crash (not Byzantine)! Hence, protocol cannot always be live (but there exist randomized BFT variants that are probably live) [See Fischer, M. J., Lynch, N. A., and Paterson, M. S Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), ] In the system Fischer, Lynch, and Paterson studied, messages were unordered, communication was unbounded, and processors were asynchronous.

109 Synchronous, Byzantine worldAsynchronous Fail-stop Byzantine

110 Byzantine General Problem Example - 1Phase 1: Generals announce their troop strengths to each other 1 P1 P2 P3 P4

111 Byzantine General Problem Example - 4Phase 2: Each general construct a vector with all troops P1 P2 P3 P4 1 2 y 4 P1 P2 P3 P4 1 2 x 4 P1 P2 y x z P1 P2 P3 P4 1 2 z 4 P3 P4

112 Byzantine General Problem Example - 5Phase 3,4: Generals send their vectors to each other and compute majority voting P1 P2 P3 P4 1 2 x 4 e f g h z P1 P2 P3 P4 1 2 y 4 a b c d z P1 P2 P1 P2 P3 P3 P4 P4 (a, b, c, d) (1, 2, ?, 4) (e, f, g, h) (1, 2, ?, 4) P1 P2 P3 P4 1 2 x 4 y h i j k (h, i, j, k) P3 P4 P1 P2 P3 (1, 2, ?, 4)

113 Practical Byzantine Fault Tolerance:Asynchronous, ByzantineFail-stop Byzantine

114 PBFT ideas PBFT, “Practical Byzantine Fault Tolerance”, M. Castro and B. Liskov, SOSP 1999 Replicate service across many nodes Assumption: only a small fraction of nodes are Byzantine Rely on a super-majority of votes to decide on correct computation. Makes some weak synchrony (message delay) assumptions to ensure liveness Would violate FLP impossibility otherwise PBFT property: tolerates <=f failures using a RSM with 3f+1 replicas

115 PBFT main ideas Static configuration (same 3f+1 nodes)To deal with malicious primary Use a 3-phase protocol to agree on sequence number To deal with loss of agreement Use a bigger quorum (2f+1 out of 3f+1 nodes) Need to authenticate communications

116 PBFT Wait for: 2f+1 2f+1 f+1 Client Primary Replica 2 Replica 3Request Pre-Prepare Prepare Commit Reply Wait for: 2f+1 2f+1 f+1

117 Security