1 Distributed Systems http://net.pku.edu.cn/~course/cs501/2013 Zhi YangSchool of EECS, Peking University
2
3
4 Load balance
5 Exchange server load Everest: write off-loading for I/O peaks
6 Write off-loading Reads and Writes Everest store Reclaims WritesEverest client Everest store No off-loading Reclaiming Off-loading Reads Everest store Volume Everest: write off-loading for I/O peaks
7 Reclaiming in the backgroundEverest client Everest store “Read any” Everest store write delete(block range, version) Volume
8 Detours DLL redirectionScaling with #stores OLTP client Store LAN SQL Server binary Detours DLL redirection Everest client Store Data Log Store Dushyanth Narayanan
9 Circular on-disk log HEAD TAIL Write Reclaim Spin up H ........ 8 7 12 7-9 X 7 X 8 9 X 4 X 1 2 X Write Reclaim Spin up
10 Why not always off-load?OLTP client OLTP client SQL Server 1 SQL Server 2 Everest client Read Read Write Read Write Write Data Data Store Store Dushyanth Narayanan
11
12 Architectural styles: Layered styleAn key observation is that control generally flows from layer to layer: requests go down the hierarchy where as the results flow upward. Observation: Layered style is used for client-server system; 2.1 Architectural styles
13 Another Description of 3-Tier Architecture2.2 System architecture
14 Physical 3-Tiered architecure2.2 System architecture
15 (c) (a) (b) (d) (e) 2.2 System architecture
16 Horizontally distributed servers may talk to each other.2.2 System architecture
17 E.g., Internet Search EngineAs a first example, consider an Internet search engine. Ignoring all the animated banners, images, and other fancy window dressing, the user interface of a search engine is very simple: … 2.2 System architecture
18 3-Tier Example: Clients Invoke Individual Serversstorage server Web server storage server Client Process: Computer: 2.2 System architecture
19 An example of horizontal distribution of a Web service.Disks 轮叫调度(Round-Robin Scheduling) 轮叫调度(Round Robin Scheduling)算法就是以轮叫的方式依次将请求调度不同的服务器,即每次调度执行i = (i + 1) mod n,并选出第i台服务器。算法的优点是其简洁性,它无需记录当前所有连接的状态,所以它是一种无状态调度。 2.2 System architecture
20 Stateless Servers Never keep accurate information about the status of a client after having handled a request: Don’t record whether a file has been opened (simply close it again after access) Don’t promise to invalidate a client’s cache Don’t keep track of your clients Consequences: Clients and servers are completely independent State inconsistencies due to client or server crashes are reduced Possible loss of performance because, e.g., a server cannot anticipate client behavior (think of prefetching file blocks)
21 Stateful Servers Keeps track of the status of its clients:Record that a file has been opened, so that prefetching can be done Knows which data a client has cached, and allows clients to keep local copies of shared data Observation: The performance of stateful servers can be extremely high, provided clients are allowed to keep local copies. Session state vs. permanent state As it turns out, reliability is not a major problem.
22 Thin Client
23 The XWindow System Protocol tends to be heavyweight. X uses a client-server model: an X server communicates with various client programs. The server accepts requests for graphical output (windows) and sends back user input (from keyboard, mouse, or touchscreen). This client-server terminology — the user's terminal as the "server", the remote or local applications as the "clients" — often confuses new X users, because the terms appear reversed. But X takes the perspective of the program, rather than that of the end-user or of the hardware: the local X display provides display services to programs, so it acts as a server; any remote program uses these services, thus it acts as a client. Practical examples of remote clients include: administering a remote machine graphically running a computationally intensive simulation on a remote Unix machine and displaying the results on a local Windows desktop machine running graphical software on several machines at once, controlled by a single display, keyboard and mouse. Protocol tends to be heavyweight. Other examples of similar systems? VNC Remote desktop
24
25 Why the hype now? Hard to support, no standard softwareDesktops are out of control, users load and run whatever they like on “their” workstations Hard to support, no standard software Hard to manage configuration, no telling what’s installed Risk of data loss, company data not safe Businesses need to “prove” compliance with IT security policies
26 Shared Services A user can “run away” with RAM or CPUMany users share one machine A user can “run away” with RAM or CPU Different users may need different apps that conflict with each other Can be relatively simple to deploy Sun Global Desktop, Citrix XenApp, Windows Terminal Services
27 Shared Services (a la Terminal Services, XenApp)
28 Desktop VirtualizationSimilar to shared services but: Each user gets “their own” virtual machine Machines can be spawned on demand from a golden image Desktop controller server manages user connections, VM power states, load balancing Users CAN share a machine if appropriate Citrix XenDesktop, VMware View
29 Desktop Virtualization
30 Production Example – Vmware View
31 More than a system for sharing pirated music/moviesP2P Systems Use the vast resources of machines at the edge of the Internet to build a network that allows resource sharing without any central authority. More than a system for sharing pirated music/movies
32 Characteristics of P2P SystemsExploit edge resources. Storage, content, CPU, Human presence. Significant autonomy from any centralized authority. Each node can act as a Client as well as a Server. Resources at edge have intermittent connectivity, constantly being added & removed. Infrastructure is untrusted and the components are unreliable.
33 Client/Server ArchitectureInternet Well known, powerful, reliable server is a data source Clients request data from server Very successful model WWW (HTTP), FTP, Web services, etc. * Figure from
34 P2P Architecture All nodes are both clients and servers NodeInternet All nodes are both clients and servers Provide and consume data Any node can initiate a connection No centralized data source “The ultimate form of democracy on the Internet” “The ultimate threat to copy-right protection on the Internet” * Content from
35 2.2 Summary of P2P Centralized P2P: (Napster) Pure P2P: (Gnutella 0.4)Server is a index of addressing Pure P2P: (Gnutella 0.4) No central entities Any terminal can be removed Hybrid P2P: (Gnutella 0.6) Dynamic central entities Any terminal can be removed
36 Napster Further services: Centralized systemChat program, instant messaging service, tracking program,… Centralized system Single point of failure => limited fault tolerance Limited scalability (server farms with load balancing) Query is fast and upper bound for duration can be given Oct. 4 Anh Le + Tuong Nguyen
37 Napster 5 4 6 central DB 3 1 2 Peer 3. Download Request 2. Response1. Query 4. File 1 2 Peer
38 BitTorrent : Overview File.torrent : Seeder – peer having entire file-URL of tracker File name File length Chunk length Checksum for each chunk (SHA1 hash) Seeder – peer having entire file Leecher – peer downloading file
39 Overlay Network A P2P network is an overlay network. Each link between peers consists of one or more IP links.
40 Overlays : All in the application layerTremendous design flexibility Topology, maintenance Message types Protocol Messaging over TCP or UDP Underlying physical network is transparent to developer But some overlays exploit proximity
41 Overlay Graph Virtual edge Overlay maintenance TCP connectionor simply a pointer to an IP address Overlay maintenance Periodically ping to make sure neighbor is still alive Or verify aliveness while messaging If neighbor goes down, may want to establish new edge New incoming node needs to bootstrap Could be a challenge under high rate of churn Churn : dynamic topology and intermittent access due to node arrival and failure
42 Overlay Graph Unstructured overlays e.g., new node randomly chooses existing nodes as neighbors Structured overlays e.g., edges arranged in restrictive structure
43 Gnutella Life-time of packets limited by TTL (typically set to 7)pure peer-to-peer very simple protocol no routing "intelligence" Constrained broadcast Life-time of packets limited by TTL (typically set to 7) Packets have unique ids to detect loops Oct. 4 Anh Le + Tuong Nguyen
44 Query/Response analogousGnutella - PING/PONG 3 6 Ping 1 Ping 1 Pong 3 Pong 6 Pong 6,7,8 Pong 6,7,8 Ping 1 7 Pong 3,4,5 Pong 5 5 1 2 Pong 7 Ping 1 Ping 1 Ping 1 Pong 2 Known Hosts: 2 Pong 8 Pong 4 8 Ping 1 3,4,5 6,7,8 Query/Response analogous 4 Oct. 4 Anh Le + Tuong Nguyen
45 Free riding File sharing networks rely on users sharing dataTwo types of free riding Downloading but not sharing any data Not sharing any interesting data On Gnutella 15% of users contribute 94% of content 63% of users never responded to a query Didn’t have “interesting” data Oct. 4 Anh Le + Tuong Nguyen
46 KaZaA Hierarchical approach between Gnutella and NapsterTwo-layered architecture. Powerful nodes (supernodes) act as local index servers, and client queries are propagated to other supernodes. Each supernode manages around children Each supernode connects to other supernodes More efficient lookup than Gnutella and more scalable than Napster
47 Structured Overlay Networks / DHTsChord, Pastry, Tapestry, CAN, Kademlia, P-Grid, Viceroy Set of Nodes Keys of Nodes Common Identifier Space Hashing Connect The nodes Smartly Keys of Values Keys of Values Hashing … Node Identifier Value Identifier Oct. 4 Anh Le + Tuong Nguyen
48 Hash Tables Store arbitrary keys and satellite data (value)put(key,value) value = get(key) Lookup must be fast Calculate hash function h() on key that returns a storage cell Chained hash table: Store key (and optional value) there
49 Distributed Hash Table
50 The Principle Of Distributed Hash TablesA dynamic distribution of a hash table onto a set of cooperating nodes node A node D node B node C Key Value 1 Algorithms 9 Routing 11 DS 12 Peer-to-Peer 21 Networks 22 Grids Each node has a routing table Pointers to some other nodes Typically, a constant or a logarithmic number of pointers →Node D : lookup(9) Basic service: lookup operation Key resolution from any node Oct. 4 Anh Le + Tuong Nguyen
51 Chord [MIT] consistent hashing (SHA-1) assigns each node and object an m-bit ID IDs are ordered in an ID circle ranging from 0 – (2m-1). New nodes assume slots in ID circle according to their ID Key k is assigned to first node whose ID ≥ k successor(k) Anh Le + Tuong Nguyen Oct. 4
52 Consistent Hashing - Successor Nodesidentifier node X key 6 4 2 6 5 1 3 7 1 successor(1) = 1 identifier circle successor(6) = 0 6 2 successor(2) = 3 2 Oct. 4 Anh Le + Tuong Nguyen
53 Consistent Hashing – Join and DepartureWhen a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n. When node n leaves the network, all of its assigned keys are reassigned to n’s successor. Oct. 4 Anh Le + Tuong Nguyen
54 Consistent Hashing – Node Joinkeys 5 7 4 2 6 5 1 3 7 keys 1 keys keys 2 Oct. 4 Anh Le + Tuong Nguyen
55 Consistent Hashing – Node Dep.keys 7 4 2 6 5 1 3 7 keys 1 keys 6 keys 2 Oct. 4 Anh Le + Tuong Nguyen
56 Scalable Key Location – Finger TablesEach node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table. The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle. s = successor(n+2i-1). s is called the ith finger of node n, denoted by n.finger(i) Oct. 4 Anh Le + Tuong Nguyen
57 Scalable Key Location – Finger Tableskeys For. start succ. 6 0+20 0+21 0+22 1 2 4 1 3 4 2 6 5 1 3 7 finger table keys For. start succ. 1 1+20 1+21 1+22 2 3 5 3 finger table keys For. start succ. 2 3+20 3+21 3+22 4 5 7 Oct. 4 Anh Le + Tuong Nguyen
58 Chord key location Lookup in finger table the furthest node that precedes key -> O(log n) hops Anh Le + Tuong Nguyen Oct. 4
59 An Example of CAN 1 Oct. 4 Anh Le + Tuong Nguyen
60 An Example of CAN (cont)1 2 Oct. 4 Anh Le + Tuong Nguyen
61 An Example of CAN (cont)3 1 2 Oct. 4 Anh Le + Tuong Nguyen
62 An Example of CAN (cont)3 1 2 4 Oct. 4 Anh Le + Tuong Nguyen
63 An Example of CAN (cont)Oct. 4 Anh Le + Tuong Nguyen
64 An Example of CAN (cont)I Oct. 4 Anh Le + Tuong Nguyen
65 An Example of CAN (cont)node I::insert(K,V) I Oct. 4 Anh Le + Tuong Nguyen
66 An Example of CAN (cont)node I::insert(K,V) (1) a = hx(K) I Oct. 4 x = a Anh Le + Tuong Nguyen
67 An Example of CAN (cont)node I::insert(K,V) (1) a = hx(K) b = hy(K) I y = b Oct. 4 x = a Anh Le + Tuong Nguyen
68 An Example of CAN (cont)node I::insert(K,V) (1) a = hx(K) b = hy(K) I (2) route(K,V) -> (a,b) Oct. 4 Anh Le + Tuong Nguyen
69 An Example of CAN (cont)node I::insert(K,V) (1) a = hx(K) b = hy(K) I (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) Oct. 4 Anh Le + Tuong Nguyen
70 An Example of CAN (cont)node J::retrieve(K) (1) a = hx(K) b = hy(K) (2) route “retrieve(K)” to (a,b) (K,V) J Oct. 4 Anh Le + Tuong Nguyen
71 Routing in CAN Oct. 4 Anh Le + Tuong Nguyen
72 Routing in CAN (cont) (a,b) (x,y) Oct. 4 Anh Le + Tuong Nguyen
73
74
75 Motivation Online storage services are getting increasingly popularAmazon’s S3, EMC’s Mozy … Rely on data centers. Challenges Threatened by the single point of failure. Amazon suffers outages (3 times); Gmail is down (4 times) … Social networks make downtime harder to hide. Incur high hardware, network and cooling costs. P2P storage Use idle resource of users to avoid costs Provide low availability because of churn.
76 Motivation Best of both worlds Data Center P2P Storage LayerStability of data center Low cost of P2P Data Center P2P Storage Layer
77 AmazingStore Design Combine data center and P2P storage systemLocation of replicas Data Center Outage Location of replicas Master DHT To handle master node failures, we also add some backup servers just behind master nodes, so that they implement a primary-backup strategy. Master Location of replicas Location of replicas
78 Availability ImprovementOverall availability jumps from 93.22% to 99.13% Availability gained at peer side is 83.8% P2P layer compensation Power failure
79 Erasure codes
80 AmazingStore Design Protect data securityEncrypted with the key specified by the user. Periodically checks peer eviction.
81 Initial with all 0 Each element of S is hashed k times1 x1 x2 Each element of S is hashed k times Each hash location set to 1 1 y To check if y is in S, check the k hash location. If a 0 appears , y is not in S 1 y If only 1s appear, conclude that y is in S This may yield false positive Initial with all 0
82 Bloom Filter x V0 Vm-1 1 1 h1(x) h2(x) h3(x) hk(x)
83 Bloom Errors x didn’t appear, yet its bits are already set 1 1 a b c dV0 Vm-1 1 1 h1(x) h2(x) h3(x) hk(x) x didn’t appear, yet its bits are already set
84 ICP – With Summary CacheProxy Cache Proxy Cache Client Internet Proxy Cache Proxy Cache
85 Outline 2.1 Architectural styles 2.2 System architectures 2.3 Architectures versus middleware 2.3.1 Interceptor 2.3.2 General Approaches to adaptive software 2.3.3 Discussion 2.4 Self-management in distributed systems
86 If it does, how does it affect flexibility, extensibility? We have talked about the physical architecture. Does middleware also have an architectural style? If it does, how does it affect flexibility, extensibility? Sometimes, the “native” style may not be optimal. Can we build messaging over RPC? Can we build RPC over messaging? Problem: In many cases, distributed systems/applications are developed according to a specific architectural style. The chosen style may not be optimal in all cases ⇒ there is a need to (dynamically) adapt the behavior of the middleware when needed. Interceptors: Intercept the usual flow of control when invoking a remote object: 2.3 Architectures vs. middleware
87 Request level could handle replication. Interceptors Conceptually, an interceptor is nothing but a software construct that will break the usual flow of control and allow other (application specific) code to be executed. Request level could handle replication. Message-level could handle fragmentation. 2.3 Architectures vs. middleware
88 Adaptive Middleware Separation of concerns: Try to separate extra functionalities and later weave them together into a single implementation ⇒ only toy examples so far. Computational reflection: Let a program inspect itself at runtime and adapt/change its settings dynamically if necessary ⇒ mostly at language level and applicability unclear. Component-based design: Organize a distributed application through components that can be dynamically replaced when needed ⇒ highly complex, also many intercomponent dependencies. Observation: Do we need adaptive software at all, or is the issue adaptive systems? What interceptors actually offer is a means to adapt the middleware. 2.3 Architectures vs. middleware
89 Outline 2.1 Architectural styles 2.2 System architectures2.3 Architectures versus middleware 2.4 Self-management in distributed systems
90 Self-managing Distributed SystemsObservation: Distinction between system and software architectures blurs when automatic adaptivity needs to be taken into account: Self-configuration Self-managing Self-healing Self-optimizing Self-* Note: There is a lot of hype going on in this field of autonomic computing. hype [haip] n. <美俚> 皮下注射, 骗局,大肆宣传,大做广告,瘾君子 2.4 Self-management in distributed systems
91 Feedback Control ModelObservation: In many cases, self-* systems are organized as a feedback control system: 2.4 Self-management in distributed systems
92 Example: Differentiating Replication Strategies in GlobuleGlobule: Collaborative CDN that analyzes traces to decide where replicas of Web content should be placed. Decisions are driven by a general cost model: cost = (w1 × m1) + (w2 × m2) + · · · + (wn × mn) Globule origin server collects traces and does what-if analysis by checking what would have happened if page P would have been placed at edge server S. Many strategies are evaluated, and the best one is chosen. what-if n. 假设分析, 作假定推测 globule n. 小球, 水珠
93 The dependency between prediction accuracy and trace length2.4 Self-management in distributed systems
94 Summary Architectural styles System architecturesArchitectures versus middleware Self-management in distributed systems