1 Why does the Cloud stop computing?Lessons from hundreds of service outages Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar
2 SoCC '16 Oct '16
3 Outages Bugs 2 years ago @ SoCC ’14SoCC '16 Oct '16 Bugs Outages 2 years SoCC ’14 Study of bugs in datacenter distributed systems (Hadoop, HBase, etc.)
4 Public reports! Headline news and post-mortem reports Pros/consSoCC '16 Oct '16 Public reports! Headline news and post-mortem reports Providers’ transparency Untapped information Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance
5 COS: Cloud Outage StudySoCC '16 Oct '16 COS: Cloud Outage Study ? 32 services 597 outages between ~70% report downtimes ~60% report root causes
6 SoCC '16 Oct '16
7 Downtime/year On average Worst year 5-nine availability?SoCC '16 Oct '16 Downtime/year On average 6% services do not reach 99% availability (>88 hours) 78% not reach 99.9% (>8.8 hours) Worst year 31% not reach 99% 81% not reach 99.9% 5-nine availability? It’s just a dream? Hours
8 Root causes (sorted by count)SoCC '16 Oct '16 Root causes (sorted by count)
9 Interesting Root CausesSoCC '16 Oct '16 Interesting Root Causes Upgrade Involves multi-layers “a code push behaved differently in widespread use than it had during testing” To understand/reproduce, need full ecosystem
10 Interesting Root CausesSoCC '16 Oct '16 Interesting Root Causes Human mistakes Rare now (vs. 10 years ago) Config/Upgrade software bugs Bugs in automation process Similar issues? But root cause origins are different
11 Config vs. Upgrade ResearchSoCC '16 Oct '16 Config vs. Upgrade Research Upgrade #1, need more research? Paper count in last few years Challenges: Multi-layer Full ecosystem needed Multi-year? Reproducible bugs from industry (benchmarks)? Conference Config papers Upgrade ASPLOS 1 ATC 6 2 DSN 8 EuroSys 3 NSDI OSDI 4 SOSP … Total 27
12 Interesting Root CausesSoCC '16 Oct '16 Interesting Root Causes Bugs What types of bugs lead to outages? Why are not masked? (pls. see paper) “Cascading” bugs
13 SoCC '16 Oct '16 “DynamoDB Storage servers query the metadata service for their membership” “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” “As a result, the storage servers were unable to obtain their membership data, and removed themselves from taking requests” Storage servers Metadata service Remove self Timeout Busy
14 Data collection servers Memory leakSoCC '16 Oct '16 “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” “data collection servers … had a failure” “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers … “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Data collection servers Memory leak Failure
15 SoCC '16 Oct '16 (more in the paper)
16 Where is the SPOF? Redundancies, redundancies, redundancies!SoCC '16 Oct '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?
17 Failure recovery chainSoCC '16 Oct '16 Failure recovery chain Failure Detection Failover Backups
18 Imperfect failure recovery chainSoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail
19 Imperfect failure recovery chainSoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Incomplete error/failure detection Undetected (specific type of) memory leaks Load spikes of authentication requests “an unexpected hardware behavior”
20 Imperfect failure recovery chainSoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Failover/recovery that fails Bad PLC fails to activate backup power generators Failed network switch failover DC failover fails due to cold cache problems Recovery/re-mirroring storm
21 Imperfect failure recovery chainSoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Multiple failures! Double failures of power, network, storage or server components Diverse failures: network+server; storage+fibre cut Cascading bugs … … that caused many/all redundancies to fail
22 COS Database: ? Email us / Check our websiteSoCC '16 Oct '16 COS Database: ? us / Check our website More correlations between … Root cause & downtime Service maturity & downtime Root cause & impacts Root cause & fixes Etc.
23 Conclusion Features and failures are racing with each otherSoCC '16 Oct '16 Conclusion Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s tradition Hope COS tells the cause Many more examples/details in the papers
24 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.eduSoCC '16 Oct '16 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu
25 EXTRA
26 Manually extract outage “metadata” Classifications:SoCC '16 Oct '16 Manually extract outage “metadata” Classifications:
27 SoCC '16 Oct '16 A service outage implies an unplanned unavailability of partial or full features of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.
28 #Outages/year On average Worst YearSoCC '16 Oct '16 #Outages/year On average 1/3 of the services, at least 3 unplanned outages per year Worst Year (between ’09-’14) ½ of the services, at least 4 unplanned outages per year
29 Downtime by root cause (sorted by median downtime) COS @ SoCC '16Oct '16 Downtime by root cause (sorted by median downtime)
30 Maturity helps? Does service maturity help? Based on outage count:SoCC '16 Oct '16 Maturity helps? Does service maturity help? Based on outage count: In 2014, 24 outages occurred from 9-yr old services
31 Maturity helps? Based on downtime:SoCC '16 Oct '16 Maturity helps? Based on downtime: In 2014, 267 hours of downtime from 17-yr old services More mature more popular more users more complex
32 Interesting Root CausesSoCC '16 Oct '16 Interesting Root Causes Load Spikes of non-monitored requests User requests (monitored) Database index accesses Authentication requests (cryptographic consumption) Misconfiguration Ex: traffic redirection Take-away: be careful with traffic-related code/configs Recovery feedback loop
33 Interesting Root CausesSoCC '16 Oct '16 Interesting Root Causes Cross (dependencies) Amazon Web Services Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine Azure Xbox Live and “52 other services” Google DC (co-location) Google Gmail, Search, Drive, Youtube (40% drop of internet traffic for 5 mins)
34 Studies of failures, enough?SoCC '16 Oct '16
35 Studies of failures, enough?SoCC '16 Oct '16 Not all report “d”owntimes Most study only a few services (data behind company walls)