1 CMG imPACt 2016 Retrospective Jonathan Gladstone Bank of Montreal John Slobodnik Cogeco Connexion
2 Conference Summary Nov. 7-10, 2016, in La Jolla, CA132 sessions in 13 tracks + annual meeting vendor sessions + BOFs + evenings Cloud (13), Big Data (6), Mobile (2), Cyber Security (7), Virtualization (7), System z (26), Network Cap & Perf (5), Storage (1), Application Performance (23), ITSM (7), Analysis & Reporting (24), Server Computing (2), Organization & Culture (9) Over 200 of your favourite performance & capacity analysts, planners, architects and scientists!
3 John’s Experience Networking with peersAttending vendor presentations for tools we are using and/or considering and/or just wanting to learn more about Going since 1999 Presentations on all hot topics of the day Lot of new idea’s to bring back to your shop and implement right away
4 Windows System Performance, Measurement and Analysis (1 of 3)Jeffrey Schwartz presenter. Lots of perfmon explanations are meaningless Processor Queue Depth – limited usefulness Processor Busy – time not spent in Idler task Kernel (internal VM) – mem mgmt, supersedes all else (eg. driver) VS. User Mode – moving more here all the time Sum of process processor times can be >100% (multi-threading) – most Win processes are MT. eg. MS SQL DB Run on the same processor if you can, there are benefits Scheduling tweaks: 1. Priority boosts – after I/O completion 2. Quantum stretching – to allow I/O to complete HT – 2 feeds into a processor, back in favor, can kill/help/in between Benefits OLTP – don’t use for batch, mixed batch/OLTP or reporting You want a process that gets in and out quick (<200ms bus txn) Max 30% improvement in shop with 1,600 txns/sec Make sure DBMS is aware of HT – SQL Server 2008 doesn’t know about HT, Win 2008 knows, SQL Server 2012 knows Jeff’s HT blog – CPU-Z from CPUID.com will tell you if HT is turned on, eg. 4 cores on a socket on most machines, if HT the software will see 8. HT is on every new machine. Wait metrics will tell you where HT is not working. Do not run HT on Win VMWare (will remove instability inherent in HT, very workload dependent). If the workload changes from OLTP to batch you will be in trouble. Multi-core – describe it as physical vs. logical. Chips/sockets are interchangeable terms. HT is set at BIOS level, OS is on top of that, will show you the max # of processors in TaskMgr/PerfMon. Processor Interrupts – Physical machine flying, virtual machine running slowly, PIs can be HW or SW generated. Immediate (eg. for IO), Deferred (DPC). More and more SW interrupts (DPC) on newer Win versions.
5 Windows System Performance, Measurement and Analysis (2 of 3)Working set, i.e. how much mem are you using? If upward trend line or upward zigzag it’s a memory leak. Task Manager shows Standby & Modified page frame state metrics. Sys Cache + Avail > 100% of mem on box (double counting). MS Office WORD cleans out “memory mapped” after 5 mins of inactivity. Cache OPS, want to see lazy writes. IO subsystem, obscure, important for DBMS (disk perf metrics are low level, slide 57), poor perf disk blamed on OS. Win disk #’s are reliable (not perfect), RAID 5 good but not if >10% writes, use RAID 10 Disk IO times are Service + Queue Slide 68, disk controller issues, not a lot of metrics available here Network – don’t turn on promiscuous mode If VMWare hosts are > 50% busy metrics become unreliable (utilizations, rates, elapsed times). If metrics with timestamps 30 seconds apart you’re good, if 35, 37 seconds then metrics unreliable. Fix, run a script that sleeps for 1 minute or check agent timestamps. Sample every 30 seconds on Win, 60 seconds on DBMS Be careful turning on process info in Win, use in short bursts on PROD (slide 80, very important) PerfMon – totally rewritten since Win 2008 – use highlighter to highlight metrics you are interested in. In Resource Monitor ignore disk utilization time counter. Highest Active Time is useless for Disk in ResMon. Use SysInternals Process Explorer for troubleshooting.
6 Windows System Performance, Measurement and Analysis (3 of 3)DBMS IO issues, see slide 93 PerfMon GUI default is 1 second, change it to 5. Use relog with PerfMon. Good metrics on slide 102. Context switching can be 60-70K today, don’t worry, unless > 100K Page reads/sec is not only page file activity (has kernel and DPC) Demand Zero faults/second when allocating mem on DBMS Slide 114, memory usage, slide 117 has page faults formula. Use ResMon to see Page File utilization in real time. Hard faults calculation (118) very important for paging. Disk counters to use (123, 128, 137), transfer times, %idle, disk queue depth is not affected by VM skewing since it’s a snapshot % disk time useless, use %idle File Control Bytes, useful for CapPlan, not a lot of good doc on it, good for before and after snapshot Scenario: Shared Disk: SQL Server took a hit when something else (Exchange backup) was running even though SQL Server workload did not change. For shared mapped drives “Redirector” is a good metric. Useful metrics on 168, If lots of DPC’s then troubleshoot.
7 Achieving Scalability & PerformanceLukas Sliwka presentation (Grindr) App connects in real time with people around you Php, Ruby, Scala & Java-based, considering GoLong, using Amazon cloud now Took a small team and grew it to 150 people now. Asian expansion. Designed scalable cloud architecture with vendor assistance. Multi-DC (76 DC’s in world) global cloud topology. Role now OPS and culture (70% of time). Put in process, otherwise best tools are useless. Today’s top consumer apps expect 0.1s resp time CDN acceleration, use LoadFlare CDN, dyn Restful APIs RailGun service allows your browser to connect to closest local optimized network Scala-based framework – aka and java, Infrastructure as Code Automation Google Cloud Pub/Sub, Redis, Elastic Search, Amazon Aurora Hires must be collaborative.
8 Is your Capacity Available?How many 9’s of availability do you want to achieve? eg. I want to run service Foo at 5 nines of reliability. Igor Trubin presenter Paper describes the statistical formula’s to use to determine the answer. Can be used for cluster right-sizing. Builds in redundancy for horizontal scaling. Cheaper commodity nodes can be used to achieve the same availability. The less redundant configuration has more available individual components/nodes. HP tool for Incident Mgmt has MTTR metric available. Amazon Cloud says they provide 11 9’s.
9 Performance Monitoring and Capacity Planning for Cloud & SaaSEllen Friedman / Priyanka Arora speakers VMWare/AWS have partnered for IAAS (Oct. 2016) Use HP Business Availability Centre – synthetic transactions dynaTrace UEM dT Digital Experience, synthetic OpNet appliance for network monitoring CapPlanning and Perf reporting best practices Key Volume Indicators (KVI’S) Use R for non-linear regression of business transaction volumes ShinyDashboard, runs R script in background after the user selects metrics.
10 CyberSecurity 2016 is the year of Ransomware. nist.gov\cyberframeworkImagine if one day camera’s and fridges are doing DDOS attacks Use a CyberSecurity self-analysis tool Over 80% of successful breaches in the last 5 years have been by Stride and Capex are good frameworks to use Run vulnerability scanners CyberSecurity is a rapidly growing career field, you cannot be bored. Have your kids grow up to be…
11 VMWare Training (1 of 3) ESX hypervisor runs processes (Worlds)Vmdk – logical disks “esxtop” shows all running World’s. vCenter is easier. A VM is a group of Worlds, 1 for each vCPU, 1 for VMM, 1 for MKS. The VM kernel has its own Worlds running the hypervisor. Not many people use Resource Pools to manage user expectations. Could waste resources. CPU & Mem have Limit, Reservation, Shares (raise priority) Expandable Reservations – mitigate risk across resource pools vCenter can manage multiple Datacentres, Hosts, DataStores DataStore – VMFS (iSCSI/FC), NAS, RDM (direct access to LUN) You can mix fast & slow disk in the same DataStore Cluster. Why do it? Need to use Affinity so migration will not put you on slower disk. vSwitches are created in vCenter Monitoring “# of vMotion migrations” is a key Capacity metric, if high something is wrong with Cluster, i.e. imbalanced load on Cluster DRS – constantly monitors CPU & Mem on Host (root resource pool), does dynamic load balancing of running VM’s FT (Fault Tolerance) – VMWare will create a separate secondary VM (lock-step technology) on a different host & datastore. You can see it from vSphere. Designed for HW failure. From a capacity viewpoint it uses disk space, CPU & memory. If you have bad SW it gets replicated. Affinity – CPU Affinity – can ping to physical CPU’s on host. DRS Affinity – to keep VMs together or to keep VMs apart.
12 VMWare Training (2 of 3) CPU Scheduling – schedule vCPUs on physical CPUs on host. Checks CPU utilization every 20ms Monitor host, guest & CPU Ready time (in ms in vCenter) Memory Reclamation Challenges – hypervisor cannot reclaim released VM memory even though the memory is on the “free list” of VM Host Level Swapping is last resort, should always be 0 Transparent Page Sharing – enabled by default, need to turn it off due to security issue Vmmsys.ctl is the balloon driver name on VM guest, should have 0 activity here Linux is greedy with memory Memory compression, there may be a vCenter metric for it Cap Planning – Need AVG and PEAK memory demands of each VM & resource pool Use multiple memory performance metrics, not just host memory usage Storage monitor – Provisioned (physical / logical) vs. Used Storage & Storage vMotion activity Thin Provisioning – do not use for DBMS (Phys->DataStore->vmdk), be aware of fragmentation, thin provisioned disks expand by 1 MB increments vSAN – allows you to take advantage of local disk Performance metrics from vCenter CPU – 1. %util vCPU, 2. vCPU ready time % (ROT >5% bad, VMWare says 10%), is workload dependent Myth – adding vCPUs arbitrarily may make things worse, check CPU Ready 3. Costop – should be LT CPU Ready Time, VM cannot run due to Scheduling constraints Memory – “zero” memory – memory over-provisioned Headroom chart – Host Phys Mem GB vs. Mem Consumed by VMs Avg GB Host Mem Ballooned Avg MB vs. Swap Space in use Avg MB IF Consumed Host Mem > Active Mem it is normal, otherwise you will have a performance problem
13 VMWare Training (3 of 3) Disk IOLatency in Kernel Avg ms Latency in Queue Avg ms Latency of Device, should be <10ms “Number of Aborted Disk Commands” metric must be 0 otherwise storage is overloaded on LUN “# of Active Commands Queued” – significant double-digit values point to a shortage Physical DataStores, Thin Provision Report DataStore Cap vs. Log Disk Cap vs. Log Disk Used Network – dropped packets, dropped Tx, dropped Rx, adjust VM shares KPIs: 1. Ratio of Virt to Phys Systems (going up, good) 2. Overall CPU Utilization (going up, good) Cap Planning – stack Host CPU Usage for each Cluster vROPS has an upper bound of 50K VMs
14 Yes, And: What Performing Improv Has Taught Me About Collaborative ITGlenn Anderson presenter Create something out of nothing One of the actors has focus, someone else takes focus T-Shaped pro’s have Social + Technical skills At a meeting bring a brick, not a cathedral. Think Yes and on top of brick. Trust others to build upon your idea dn you do the same for them Yes and But does not replace quality or common sense Listening – your response has to start with the work the previous person ended with A book called “Yes, And”
15 Performance Management in the Virtual Datacentre: An Assessment (1 of 2)Mark Friedman presenter Virt overheads same for VMWare and HyperV Perf Stretch Factor – right-sizing guests, overcommitted hosts, under- provisioned guests Guest will suffer if underprovisioned or if host is over-provisioned Guest VM measurement anomalies Paravirtualization brings app perf on VM guest closer to native phys Win server, throught VMTools IO intensive workloads will suffer in VMWare due to OVHD CPUID command tells you Win server is virtualized Cannot do WLM in VMWare Virt host doesn’t know when a guest frees up a page (see Win Mem Mgmt paper from a year ago) vMotion for important servers in PROD, turn it off, is potentially disruptive 15% ovhd cost to moving workload from physical to virtual server Sometimes using 2 virt guests to run 1 physical system workload works well
16 Performance Management in the Virtual Datacentre: An Assessment (2 of 2)Host & Guest Pending Interrupt Time – good metrics Device response time IO – rdtsc cmd is initiated, 2 per IO on guest (instrumenting emulation has a cost). See his blog for details. Win/Sys/Processor Queue Length Mem Avail Bytes CPU Ready Memory VMWare takes mem pages from all guests on host when ballooning since host doesn’t know what pages guests are using. Some guests will be adversely affected, some won’t (see Mark’s paper from last year) Aggregate guest CPU usage / # of guest machine processors %Processor Time is reliable, esxtop uses the same sampling (on blog entry) Disk bytes/sec is a good metric See “Computer Performance By Design” blog
17 Other Presentations Andreas Grabner – DEV/OPS dTElisabeth Stahl – Evolving your DC in a Cloud World Panel Discussion – IoT Jeff Schultz – TeamQuest Predictive Analytics and Advanced Reporting Keynote each day
18 Jonathan’s Experience (1 of 4)Attended 27 presentations 2 Cloud, 3 Big Data, 1 Cyber Security, 11 System z, 1 Application Performance, 1 ITSM, 5 Analysis & Reporting, 2 Organizational & Cultural First time as Director & International Officer Extra meetings & co-ordination work Now, just a few examples…
19 Jonathan’s Experience (2 of 4)Using Hadoop, MapReduce & Spark to Process Big Data (Big Data) Three-part “training” session Introduction to concepts, tools and methods used for big data processing Odysseas Pantakalos, SYSNET International Is Your Capacity Available? (A&R) Incorporate availability requirements into capacity planning Equations for availability of sequences of clusters of servers in complex systems Igor Trubin, CapitalOne Bank Hadoop: gave good book references; described topologies, NameNodes & DataNodes, MapReduce v1 & v2, using HDFS or YARN for storage; Flume aggregates data inputs to NameNodes; Spark runs data transformations
20 Jonathan’s Experience (3 of 4)Blockchain: What, Why, How (C.Sec.) Introduction to blockchain concepts What workloads are suited to it? Standards, tools, use cases Elisabeth Stahl, IBM z/OS various sessions (System z) Performance topics, z13 updates, best practices, do’s and don’t’s, use of sub-capacity GCPs, SMT measurements, tuning basics, Spark & SMF… Assorted high-profile experts included Scott Chapman, Glenn Anderson, Kathy Walsh, Frank Kyne, Ivan Gelb…
21 Jonathan’s Experience (4 of 4)Risky Business: Modelling & Predicting System Performance (A&R) New work being done on defining and quantifying types of risk inherent to predictions from models Ways to mitigate risk Jeff Buzen, independent consultant Achieving Scalability & Performance >1M Concurrent Users at a Time (Cloud) Ground-up development of an Agile, DevOps application development & delivery pipeline residing in cloud Small corporation, but multi-national presence, 7/24 operation with fast response and legal & moral consequences for some kinds of failures of delivery Lukas Sliwka, CLO, Grindr Risky Business: forecasting risk, modelling risk, workload characterization risk, solution risk, applicability risk; risk increases with level of data summarization and with assumptions required in model. Dr. Jeff Buzen was one of the developers of BestOne (now BMC Perform & Predict); he’s a university professor of statistics, author and significant contributor to the field.
22 imPACt 2017 Nov. 6-9, 2017, in New Orleans, LALots of sessions in lots of tracks again, plus hackathon and other new features Excellent deal available until March 1st: US$999 for conference + annual membership