SelectHealth Big Data Presentation - BYU

1 SelectHealth Big Data Presentation - BYUMarch 15, 2017 ...
Author: Opal Powell
0 downloads 0 Views

1 SelectHealth Big Data Presentation - BYUMarch 15, 2017

2 Where Does the Data Come From?

3 Tableau Dashboard

4 Tableau Dashboard

5 John’s Opinions Disclaimer: The following presentation is presented independently by John Kerley-Weeks. Content has not been approved or verified by Intermountain Healthcare, SelectHealth or affiliated organizations for distribution. Opinions presented are solely those of the presenter. I am not a statistician, doctor or a lawyer. I am a Data Architect. Facts and figures presented are intended as general examples and have not been fully vetted for accuracy. Sources are typically not provided and are not intended to be shared or published outside of today’s class. Do not record, share or publish contents

6 Who is John? SelectHealth Sr. Data Architect (Data Sherpa)Masters in Information Technology - Carnegie Mellon - Summa Cum Laude Utah State University – 12 years Adjunct Faculty – Hadoop, Data Warehouse Data Systems Manager – Utah Climate Center and Remote Sensing/GIS Lab Silicon Valley – 20 years Sun Microsystems Oracle Openwave (Phone.com) Quintus (grew from 90 to 500 employees, public IPO 1999)

7 Intermountain Healthcare Profile22 hospitals 37,000 employees 600,000 members 25% market share 200 clinics 1,000 employed physicians 1975 1983 1994

8 Intermountain – MissionMission - Helping people live the healthiest lives possible Vision – Be a model health system by providing extraordinary care and superior service at an affordable cost Values – Integrity, Trust, Excellence, Accountability, Mutual Respect SelectHealth Mission – Helping people live the healthies lives possible

9 Intermountain – Not-for-profit Commitmenthttps://intermountainhealthcare.org/-/media/files/trustee-resource-center/topical-information-pdfs/not-for-profit.pdf?la=en As a not-for-profit health system, Intermountain Healthcare makes decisions differently. • We’re committed to going beyond treating illness and injury by encouraging healthy behaviors that can lead to longer, more fulfilling lives. • Helping those in need We donate hundreds of millions of dollars in financial assistance to those unable to pay. • Setting community-based priorities. We operate without shareholders. Instead, we ask: What will benefit our communities and patients? • Affordable Healthcare We’ve been recognized for our achievements in providing excellent outcomes to patients and keeping costs more affordable.

10 Avoiding Overspending

11 Emergency Department Visits 65% could be treated elsewhere

12 Intermountain Analytics Supporting Clinical and Cost Improvements(Illustrative Examples) Heart Failure Mortality Rates Less than half the national average Sepsis Mortality Rates Less than half the national average. The “unknown killer” 50% of patients dying in US hospitals had sepsis Reduction of Elective Inductions in Pregnancy Elimination of elective inductions prior to 39 weeks. NICU utilization reduced by nearly 50%. Projected $5.3B annual savings if adopted nationwide. Colon Surgery $1.2 million annual savings, Length of Stay decreased from 8.44 to 6.75 days, while maintaining or improving clinical quality. - Computerworld Business Intelligence Award – Driving Process Change with BI Surgical Price Reduction Nearly $60M cost reduction for knee and hip replacement over 3 years while improving clinical outcomes Other Clinical Quality Improvements Diabetes, asthma, community acquired pneumonia (CAP), blood utilization, 50+ standardized care processes Healthcare Operations Improvements Lab operations, supply chain, operating room (OR), hospital operations, patient satisfaction, core measures, meaningful use, population health, shared accountability

13 Seemingly Unlimited Potential for Data Driven Cost Savings and HealthMost Expensive Conditions treated in U.S. Hospitals Condition Costs Patients Sepsis $20,298M 1,094K Osteoarthritis (“wear and tear” arthritis: knee, hip, back, neck) $14,810M 964K Complications of device, implant or graft $12,881M 699K Liveborn (pregnancy) $12,390M 3,818K Acute Myocardial Infraction (heart attack) $11,504M 612K Spondylosis and back problems $11,218M 667K Pneumonia $10,570M 1,114K Congestive heart failure 970K Coronary atherosclerosis (blockage of heart arteries) $10,400M 605K

14 Predicting Hospital Readmissions    Comparison of LACE to Logistic Regression (neither are very good) LACE AUC = 0.628, Logistic Regression AUC = 0.650 (Area Under the Curve of is bad, is perfect) LACE Length of stay Acuity of admission Charlson Comorbidity Number of visits to Emergency Department                                                            Calculation Algorithm:

15 Predicting Hospital Readmissions    Excellent Paper comparing LASSO and GBM Data Models LASSO AUC = 0.69 Least Absolute Shrinkage and Selection Operator GBM AUC = 0.70 Gradient Boosted Machine 14 million in patient discharges from 2013 Current Models using R XGBoost are achieving AUC > 0.75                                                            Calculation Algorithm: Charlson Comorbidity

16 Decision Trees Income vs. Age

17 What is Big Data? INTERMOUNTAIN’S DEFINTIONUsing additional data sources and new analytic tools to produce superior, actionable analytic insights (not previously possible or cost effective) leading to Improved healthcare outcomes Reduced cost Healthier people

18 Interview Questions You Should be able to AnswerEnterprise Data Warehouse (EDW) Describe the difference between OLAP and OLTP Describe the difference between Kimball and Inmon’s Philosophies Big Data What are the three or five V’s of Big Data Describe three challenges of big data deployments Provide two examples of the importance of Value to Big Data SQL Analytics Describe the differences between the three types of RANK’ing functions Describe the difference between partitioning and windowing functions

19 New Data Sources and Analytic ToolsPOSSIBILITIES FOR NEW, SUPERIOR INSIGHTS Additional Data Sources Genomics data High volume, streaming clinical device data Unstructured physician notes, discharge summaries and clinical documentation. Population health data New ACA, CMS and NIH data Consumer data . . . Potential New Tools Low cost, high volume storage (Commodity hardware and Hadoop) Natural Language Processing NLP (Convert unstructured data to structured data with clinical significance) Machine learning correlation and causation discovery (accelerating time to insight and increasing analytic capacity)

20 What are the Five V’s of Big Data?

21 Five V’s of Big Data

22 Value WHAT HAS BIG DATA ACCOMPLISHED IN HEALTHCARE?Reduction in readmission rates - achieving rates less than half the national average. 90% echocardiogram diagnosis accuracy through advanced machine learning and correlation analysis 40% – 50% reduction in cost of care for some high-cost patients through “hotspotting” Personalized medicine through genetic matching of patients to cord blood for cancer treatment “Next top 5%” - predicting the emerging high-cost patient population Early prediction and intervention of neo-natal sepsis through real-time device monitoring Suicide prevention using predictive analytics to pinpoint mental health risk factors and identify potentially harmful behavior before it occurs Reduction in first-year attrition of nurses, achieving outcomes less than half the national average. Speed to actionable insight using advanced machine learning to achieve analytic insight 10 to 100 times faster than current methods Accounts Receivable - 23% increase in patient payments

23 Volume THE AMOUNT OF DATA VOLUME VARIETYClinical device data – Ventilators, Respirators, Monitors, etc. Genomic Data CDR Clinical Documentation Personal Device Data External Population Health Data VALUE VELOCITY VERACITY

24 Variety STRUCTURED VERSUS UNSTRUCTURED DATA VOLUME VARIETYStructured: Electronic health records, medical claims, actuarial data, clinical data, health information exchange, genomics, biometrics. Unstructured: digitized medical records, handwritten clinicians notes, radiology films, MRI, CT, and other images. VALUE VELOCITY VERACITY

25 Velocity HOW FAST DATA IS ACQUIRED AND ANALYZED VOLUME VARIETYReal-time monitoring data such as blood pressure, anesthesia, bedside heart monitors, Medium Velocity (diabetic glucose and insulin pumps, blood pressure readings, EKGs, personal wellness devices.) VALUE VELOCITY VERACITY

26 Veracity THE INTEGRITY OF THE DATA VOLUME VARIETYAlignment of data from disparate EHR, EMR, and HIE systems Diagnostic, procedure, and claims coding Unique patient, provider, and organization identification Privacy, security, and compliance Semantic data interoperability for high-confidence analytics VALUE VELOCITY VERACITY

27 Star Schemas Enrollment, Claims, Prescriptions, Care Management

28 Data Warehouse StructuresWhat is a Data Warehouse?

29 Intermountain Healthcare Enterprise Information ArchitectureInsights, Algorithms, Decision Support, Embedded Analytics Data Sources (Operational and clinical systems, external systems, research, medical devices, etc.) Relational EDW Business Intelligence (BI), Reporting, Analytics and Applications Data Lake Data Integration Layer Semantic Management Master Data Management Federated Search Cloud Data Quality, Metadata and Data Lineage Management Processes: Data Governance, Data Quality, Master Data Management, Semantic and Metadata Management, Security, Compliance, Legal

30 Intermountain Healthcare Enterprise Information ArchitectureAggregated Data Repositories Relational Data Warehouse Data Lake Federated Search Cloud Centralized persistence of high volume data ELT Repository ODS for Relational Data Warehouse Schema on Read Repository Data Discovery Pre-processing for Relational Data Warehouse Data Archive Genomic Data, High volume Device Data, OLTP transaction Logs, Security transaction logs. Relational Data Store Sematic Data Store Data marts, cubes, etc. Inpatient Clinical, Ambulatory Clinical, Financial, Supply Chain, Health Plan, Research Index of external data Index of unstructured data Data discovery Clinical Documents, Enterprise Documents, Research subscriptions, Government sources, etc. Alternate high volume storage Alternate high capacity computing Oracle RDBMS, HDD, SQL Server Hortonworks Hadoop, Spark Solr Cerner Healthe Intent, AWS, Azure, etc.

31 Intermountain Healthcare Enterprise Information ArchitectureData Lake – Hadoop Ecosystem

32 Intermountain Healthcare Enterprise Information ArchitectureAnalytic Maturity

33 Intermountain Healthcare Enterprise Information ArchitectureAnalytic Maturity

34 Intermountain Healthcare Enterprise Information ArchitectureBI, Reporting, Analytics and Applications Analytic Types Methods and Tools Descriptive (What has happened) Financial and operational reporting, cost analysis, quality and compliance, meaningful use, etc. Diagnostic (Why things happened) Outcomes analysis, gaps in care, fraud detection, etc. Predictive (What will happen) Population health risk stratification, contract forecasting and modeling, diagnostic clinical decision support, etc. Prescriptive (What should happen) Care process models, prescriptive clinical decision support, precision medicine, etc. Delivered Reports ed, scheduled, etc. (Cognos) Self-Service Reports User executed on demand (Cognos) Self-Service Dashboards Analyst configured and user queried on demand (Tableau) Self Discovery Analysts and data managers exploring cubes and indexes (Power BI, Solr search) Predictive and Algorithms Statistical analyst developed (R, SAS, SPSS, etc.) Analytic Applications Purpose built analytics (Optum, Archimedes, etc.), custom alerts and embedded analytics (Cerner Healthe Intent) Machine Learning / Cognitive Learning Advanced analytics (Ayasdi, GNS Healthcare)

35 Day in the Life Videos A Day in the Life of a Data Analyst: https://youtu.be/xbTqJiZ8nhA Intermountain Using Data in Healthcare: https://youtu.be/TzWkzwuY9Xo Why I Love Working at Intermountain: https://youtu.be/07ugt1-3a90

36 Protected Health Information (PHI)Social Security Number What is PII? Names What is PCI? Telephone Number Fax Number What percent of people in U.S. can be uniquely identified with just a ZIP code, birthdate and gender? Address Web URL IP Address https://www.forbes.com/sites/adamtanner/2013/07/22/how-just-a-zip-code-can-tell-a-marketer-exactly-who-you-are/#649ced6e426a Member ID# Account # Gender Age Birth Date Death Date City State

37 SOME EXAMPLES 590 ASK – Analysts Sharing KnowledgeASWG (Analytical Services Work Group) - BSWG (Better SQL Work Group) - iCog (Intermountain Cognos User Group) - iTab (Intermountain Tableau User Group) - iStats (Intermountain Statistics Special Interest Group) - All+Da+Tryx (Intermountain Alteryx User Group) - Intermountain Ayasdi Implementation Team - Intermountain Women in Analytics Community (WIA)

38 Data and Table Size

39 SOME EXAMPLES

40 Nursing Analysis

41 BEFORE Data PreparationGraphical approach to reducing SQL and R coding PERSON DATA SQL: WITH DIM_DATE as ( SELECT trunc(date_dt, 'month')- interval '11' month as period_start_date, date_dt as period_end_date, 'year' as time_period FROM hrp.effective_DATE_DIM WHERE date_dt between trunc((trunc(sysdate, 'year' ))- interval '2' year) and trunc(sysdate + interval '1' year, 'year')- interval '1' day AND end_of_qtr_flg = 'Y' and qtr_no = 4 UNION -- quarter SELECT trunc(date_dt, 'month')- interval '2' month as period_start_date, date_dt as period_end_date, 'quarter' as time_period AND end_of_qtr_flg = 'Y' --month SELECT trunc(date_dt, 'MM') as period_start_date, date_dt as period_end_date, 'month' as time_period -- WHERE date_dt between trunc(trunc(sysdate, 'year' )) and trunc(sysdate + interval '1' year, 'year')- interval '1' day AND end_of_mnth_flg = 'Y' ) , ORG_ASSIGNMENT as ( SELECT measure_type, period_start_date, period_end_date, time_period, org_short_nm , job_family_txt ,job_cd, job_title_txt , assignment_id, person_id , range_penetration, working_hrs, dept_tenure_yrs, job_family_tenure_yrs, job_cd_tenure_yrs, hourly_rate_amt , benefits_elig , grade_level_txt, supervisor_flg, compa_ratio_val, eeo_group_nm, eeo_group_cd FROM ( SELECT start_dt, end_dt, extended_end_dt , 'assigment_data' as measure_type , CASE WHEN regexp_substr(org_dept_nm, '([0-9])+-([0-9])+') is not null then trim(regexp_substr(org_dept_nm, '([0-9])+-([0-9])+')) WHEN regexp_substr(org_dept_nm, '([A-Z])+\s([0-9])+ ') is not null then trim(regexp_substr(org_dept_nm, '([A-Z])+\s([0-9])+ ')) END AS org_short_nm , job_family_txt, job_title_txt, job_cd , assignment_status_txt, prmry_assignment_flg, assignment_id , person_id -- facts , range_penetration , working_hrs , dept_tenure_yrs , job_family_tenure_yrs , job_cd_tenure_yrs , hourly_rate_amt , benefits_elig_flg as benefits_elig , grade_level_txt , supervisor_flg , compa_ratio_val , eeo_group_nm, eeo_group_cd FROM hrp.assignment_dim_hr WHERE end_dt > (sysdate - interval '3' year) AND prmry_assignment_flg = 'Y' AND assignment_status_txt in ('Active Assignment', 'Additional Assignment', 'Paid LOA Assignment', 'Transfer, Active Assignment')) asg JOIN DIM_DATE ON period_end_date BETWEEN start_dt AND end_dt AND period_end_date between start_dt and extended_end_dt , ORG_PERSON AS ( SELECT 'person_data' as measure_type , 'http://foo_bar' as url_link --date dim , period_start_date, period_end_date, time_period -- org dim , org_short_nm , job_family_txt , job_title_txt, job_cd , ihc_tenure_yrs , perf_rating_txt , perf_rating_short_txt , age_yrs , ethnic_origin_txt , sex_cd , vet_status_txt FROM hrp.person_dim_hr per JOIN ORG_ASSIGNMENT asg on per.person_id = asg.person_id and period_end_date BETWEEN per.start_dt AND per.end_dt WHERE per.end_dt > (sysdate - interval '3' year) and person_typ_active_flg = 'Y' , DENORMALIZED_CT_PERSON as ( -- dim period_start_date, period_end_date, time_period , org_short_nm , url_link , measure_type , ' IHC Tenure Yrs' as measure_name , 'AVG' as calc_type , 'The average number of years employees have worked for Intermountain Healthcare.' as measure_definition , 4000 as sort_order -- measures , ihc_tenure_yrs as fact_number_format , null as fact_decimal_format , null as fact_currency_format , null as fact_percent_format , null as age_yrs , null as ethnicity_white , null as ethnicity_other , null as gender_female , null as gender_male , null as veteran , null as retirement_eligible , null as retirement_not_eligible , null as female_manager , null as ethnic_manager , null as female_executive , null as ethnic_executive , null as head_count FROM ORG_PERSON UNION ALL , ' Individuals Age' as measure_name , 'The average age in years of employees .' as measure_definition , null as fact_number_format , age_yrs as fact_decimal_format , null as ihc_tenure_yrs , ' Ethnicity White' as measure_name , 'SUM' as calc_type , 'The percent of employees who have identified their ethnic origin as white .' as measure_definition --measures , 1 as fact_percent_format , 1 AS ethnicity_white WHERE ethnic_origin_txt = 'White (Not Hispanic or Latino)' , ' Ethnicity Other' as measure_name , 'The percent of employees who have identified their ethnic origin as one of the following: Black or African American, Two or More Races , Native Hawaiian/Other Pacific Islander, Asian, Hispanic or Latino, or American Indian or Alaska Native.' as measure_definition -- use in tool tip , 1 AS ethnicity_other WHERE ethnic_origin_txt != 'White (Not Hispanic or Latino)' , ' Gender Female' as measure_name , 'The percent of employees who have identified their gender as female.' as measure_definition , 1 AS gender_female WHERE sex_cd = 'F' , ' Gender Male' as measure_name , 'The percent of employees who have identified their gender as male.' as measure_definition , 1 AS gender_male WHERE sex_cd = 'M' , ' Veteran' as measure_name , 'The number of employees who have identified themselves as a veteran.' as measure_definition , 1 as fact_number_format , 1 as veteran WHERE vet_status_txt is not null AND vet_status_txt != 'Not a Veteran' , ' Retirement Eligible' as measure_name , 'The percent of employees who are eligible for retirement. Tenure is 5 or more years and age is 57 and older.' as measure_definition , 1 as retirement_eligible WHERE ihc_tenure_yrs >= 5 and age_yrs >= 57 , ' Retirement Not Eligible' as measure_name , 'The percent of employees who are not eligible for retirement. Tenure is 5 or more years and age is less than 57.' as measure_definition , 1 as retirement_not_eligible WHERE ihc_tenure_yrs >= 5 and age_yrs < 57 --dim , measure_type , ' Managers Female' as measure_name , 'The percent of managers who identified as female . (Manager = EEO Groups: First/Mid Level Officials and Managers and Executive/Senior Level Officials and Managers)' as measure_definition , 1 as female_manager WHERE eeo_group_cd in (1,10) and sex_cd = 'F' , ' Managers Ethnic' as measure_name , 'The percent of managers who identified as ethnic . (Manager = EEO Groups: First/Mid Level Officials and Managers and Executive/Senior Level Officials and Managers)' as measure_definition , 1 as ethnic_manager WHERE eeo_group_cd in (1,10) and ethnic_origin_txt != 'White (Not Hispanic or Latino)' , ' Executives Female' as measure_name , 'The percent of executives who identified as female . (Executive = EEO Group: Executive/Senior Level Officials and Managers)' as measure_definition , 1 as female_executive WHERE eeo_group_cd in (10) and sex_cd = 'F' , ' Executives Ethnic' as measure_name , 'The percent of executives who identified as ethnic . (Executive = EEO Group: Executive/Senior Level Officials and Managers)' as measure_definition , null fact_number_format , 1 as ethnic_executive WHERE eeo_group_cd in (10) and ethnic_origin_txt != 'White (Not Hispanic or Latino)' , ' Total Persons' as measure_name , 'The total number of people . (Person Type: Active Employee, Active LOA Employee, Employee & LTD Recipient, Employee & Retiree; Primary Assignments : Active Assignment, Additional Assignment, Paid LOA Assignment, Transfer, Active Assignment)' as measure_definition , 1 as head_count -- create the measure subsets to use in creating sets for denominators --subset for all measures but supervisor and leaders , MEASURES_SUBSET_1 as (SELECT distinct measure_type, measure_name, measure_definition from DENORMALIZED_CT_PERSON WHERE measure_name not in (' Managers Female', ' Managers Ethnic', ' Executives Female', ' Executives Ethnic')) --subset for supervisors , MEASURES_SUBSET_2 as (SELECT distinct measure_type, measure_name, measure_definition WHERE measure_name in (' Managers Female', ' Managers Ethnic')) -- subset for leaders , MEASURES_SUBSET_3 as (SELECT distinct measure_type, measure_name, measure_definition WHERE measure_name in (' Executives Female', ' Executives Ethnic')) -- denominator for entire data set , SUM_DENOMINATOR_1 as (SELECT period_start_date, period_end_date, time_period, org_short_nm --, job_family_txt --, job_title_txt, job_cd , SUM(1) as DENOMINATOR GROUP BY period_start_date, period_end_date, time_period, org_short_nm ) -- denominators for managers , SUM_DENOMINATOR_2 as (SELECT WHERE eeo_group_cd in (1,10) -- denominator for executives , SUM_DENOMINATOR_3 as (SELECT WHERE eeo_group_cd in (10) -- Cartesian join of appropriate denominators with the corresponding measures and union all to single set , FINAL_DENOMINATOR as ( SELECT period_start_date, period_end_date, time_period, org_short_nm, measure_type, measure_name, measure_definition, denominator FROM SUM_DENOMINATOR_1 JOIN MEASURES_SUBSET_1 ON 1=1 FROM SUM_DENOMINATOR_2 JOIN MEASURES_SUBSET_2 ON 1=1 FROM SUM_DENOMINATOR_3 JOIN MEASURES_SUBSET_3 ON 1=1 ORDER by period_end_date, org_short_nm) -- summarize the data , SUM_ORG_PERSON as (SELECT period_start_date, period_end_date, time_period , measure_name , measure_definition , calc_type , sort_order , SUM(fact_number_format) as fact_number_format , SUM(fact_decimal_format) as fact_decimal_format , SUM(fact_currency_format) as fact_currency_format , SUM(fact_percent_format) as fact_percent_format , SUM(ihc_tenure_yrs) as ihc_tenure_yrs , SUM(age_yrs) as age_yrs , SUM(ethnicity_white) as ethnicity_white , SUM(ethnicity_other) as ethnicity_other , SUM(gender_female) as gender_female , SUM(gender_male) as gender_male , SUM(veteran) as veteran , SUM(retirement_eligible) as retirement_eligible , SUM(retirement_not_eligible) as retirement_not_eligible , SUM(female_manager) as female_manager , SUM(ethnic_manager) as ethnic_manager , SUM(female_executive) as female_executive , SUM(ethnic_executive) as ethnic_executive , SUM(head_count) as head_count , SUM(1) as number_observations FROM DENORMALIZED_CT_PERSON GROUP BY period_start_date, period_end_date, time_period , org_short_nm -- , job_family_txt, job_title_txt, job_cd, , sort_order) --- left join the denominator with the summarized data SELECT d.period_start_date, d.period_end_date, d.time_period, d.org_short_nm --, job_family_txt --, job_title_txt, job_cd , url_link, d.measure_type, d.measure_name, d.measure_definition, calc_type, sort_order , fact_number_format, fact_decimal_format , fact_currency_format, fact_percent_format, ihc_tenure_yrs, age_yrs, ethnicity_white, ethnicity_other, gender_female , gender_male, veteran, retirement_eligible, retirement_not_eligible, female_manager, ethnic_manager, female_executive , ethnic_executive, head_count, number_observations, denominator FROM FINAL_DENOMINATOR d LEFT JOIN SUM_ORG_PERSON p on p.period_end_date = d.period_end_date and p.time_period = d.time_period and p.org_short_nm = d.org_short_nm and p.measure_name = d.measure_name BEFORE

42 Data Preparation Graphical approach to reducing SQL and R coding AFTER

43 Data and Analytic Jobs at Intermountain

44 Data and Analytic Careers at Intermountain

45 Data Analytic Roles at Intermountain

46 Healthcare Insurance Use CasesIndividualized Recommendations for specific conditions Individualized recommendations for motivation and responsiveness Medication Adherence Fuzzy Matching across disparate databases Fraud Detection Reducing Customer Churn Evaluation of Genomic Data New Customer Identification Social Media Sentiment Analysis (Townhall Meetings) Comparisons to other providers nationwide Hospital Readmission Reductions Anonymizing data for broader analysis and data modeling

47 Big Data Use Cases

48 Supplemental Slides BYU March 15, 2017Trending and Upcoming Technologies Becoming a Data Scientist Skills and Salary Surveys Diabetes Dataset Additional Information: Data Warehouse Cubes SQL Analytics

49 Spark In-Memory Data Analysis at ScaleUp to 100x, 20x, 2x faster than Hadoop Keeps data in memory rather than disk DAG replaces MapReduce There are limits to DRAM storage size

50 IM Flash 3D Xpoint Non-volatile Memory (data preserved while power is off) 7x more IOPS (Input/Output Operations per second) 1000x lower latency than NAND flash memory Half the cost of DRAM Based in Lehi, UT “3D Cross Point”, not “3D X Point” Intel = Optane, Micron = QuantX

51 IoT 71% -> 76% Adherence improvements in first week of use of CPAPDrug Adherence https://www.slideshare.net/TimothyAungstPharmD/using-technology-to-increase-medication-adherence

52 Computer Aided Data Model Selection and TuningImages: https://www.knime.org/blog/seven-techniques-for-data-dimensionality-reduction

53 Assisted Self-Service Data Analytics

54 Black Boxes and Unicorns – DataRobot CEO Jeremy Achinhttps://www.youtube.com/watch?v=2iaOSny5EeU

55 Individually We Know NothingTogether We Know Everything

56

57 Intermountain Healthcare and SelectHealth Common ToolsDatabases: Oracle Sybase MS SQL Access SQL Query Tools: Oracle SQL Developer TOAD SQL Dbx SQL Server Management Studio Reporting: Tableau Cognos Power BI Analytics: R SAS SPSS Python Pilots: Alteryx Ayasdi Common skills that can give you a competitive edge at Intermountain Healthcare

58 Intermountain Healthcare Data Architect In-demand SkillsBest Practices, Standards and Deployment Strategies Oracle: Indexing Partitioning Query Tuning PL/SQL DB Roles SQL Analytics DataStage: Parallel Design Parameter Sets Error Handling ETL Regular Expressions Git / Subversion Jira Confluence Common skills that can give you a competitive edge at Intermountain Healthcare

59 SQL Analytic FunctionsAggregate: A single value derived from multiple values SUM, COUNT, AVG, MIN, MAX Partition, Group, Window Rank: Sorted order of values RANK, DENSE_RANK, ROW_NUMBER TOP, BOTTOM NTILE, RATIO_TO_REPORT Related Columns: Values of nearby rows and columns FIRST, LAST (FIRST_VALUE, LAST_VALUE) LAG, LEAD Grouping: Dividing large groups into associated small groups ROLLUP, CUBE, LISTAGG

60 Salary and Skills SurveysDownload from: Dice Tech Salary Survey 2016 O’Reilly data science salary survey 2016 Packt Data Scientist job skills 2016 Updated surveys are frequently posted in April each year

61 O’Reilly Data Science Salary Survey2016 Data Science Salary Survey.pdf Survey of 983 data professionals

62

63 O’Reilly Data Science Salary Survey2015 Data Science Salary Survey.pdf Survey of 600 data professionals

64

65

66 Skill Clusters Proximity indicates higher correlation

67 Regression Model of Total SalaryR-squared = .43 Location Title 43% of variation is explained + $13,200 California + $31,280 level: Principal Variable (unit) Coefficient in USD + $10,097 Northeast US + $15,642 title: Architect $30,572 intercept - $3,695 UK/Ireland + $3,340 title: Data Scientist + $1,395 age (per year of age above 18) - $18,353 Europe (except UK/I) + $2,819 title: Engineer - $23,140 Latin America - $3,272 title: Developer + $5,911 bargaining skills (times 1 for “poor” skills to 5 for “excellent” skills) - $30,139 Asia - $4,566 title: Analyst Skills Time Allocation + $382 work week (times # hours in week) + $2,287 cloud computing amount: Most or all cloud computing + $7,819 Meetings: hours / day + $9,036 Meetings: 4+ hours / day - $2,007 gender=Female - $2,710 cloud computing amount: Not using cloud computing + $2,679 Basic exploratory data analysis: hours / week + $1,759 industry=Software (incl. security, cloud services) + $9,747 Spark - $4,615 Basic exploratory data analysis: 4+ hours / day - $891 industry=Retail / E-Commerce + $6,758 D3 + $4,878 Amazon Elastic MapReduce (EMR) + $352 Data cleaning::1 - 4 hrs / week - $6,336 industry=Education + $718 company size: 2500+ + $3,371 Scala - $448 company size: <500 + $2,309 C++ + $8,606 PhD + $1,173 Teradata + $851 master’s degree (but no PhD) + $625 Hive - $1,931 Visual Basic/VBA

68 O’Reilly Data Science Salary Survey2014-data-science-salary-survey.pdf Survey of 800 data analysts and engineers

69 Regression Model of Total SalaryVariable (unit) Coefficient in USD (constant) - + $30,694 Europe - – $24,104 Asia - – $30,906 California - + $25,785 Mid-Atlantic - + $21,750 Northeast - + $17,703 Industry: education - – $30,036 Industry: science and technology - – $17,294 Industry: government - – $16,616 Gender: female - – $13,167 Age per 1 year + $1,094 Years working in data per 1 year + $1,353 Doctorate degree - + $11,130 Position per level12 + $10,299 Portion of role as manager per 1% + $326 R-squared = .58 58% of variation is explained by 27 coefficients or 86 potential predictors Final model obtained through forward stepwise linear regression

70 Most Commonly Used ToolsHighest Demand as reflected by salary

71 Developer_Data_Scientist_Job_Skills_Packt_PublishersDeveloper_Data_Scientist_Job_Skills_Packt_Publishers.pdf Skill Up Data Science & BI Salary & Skills Report Page of 104 pages Developer_Data_Scientist_Job_Skills_Packt_Publishers.pdf 3800 Respondents

72 What tools do people use on a daily basis?

73 2016 Dice.com Salary Survey of 16,301 technology professionalsTop Differentiating Skills by Popularity $109,782 Python $126,249 R $103,027 Microsoft SQL Server $113,708 PostgreSQL $112,097 Oracle $129,400 Hive $113,214 Spark $130,290 NoSQL $142,835 Cloudera $119,207 MongoDB $104,606 Linux Less Popular Top Differentiating Skills $147,811 Cassandra $128,888 Hadoop $122,383 Hbase $112,327 Teradata

74 Common Skill Sets for: Data Visualizers1. JAVASCRIPT 2. HTML 3. CSS 4. PHP 5. HTML5 6. JQUERY 7. JAVA 8. XML

75 Common Skill Sets for: Programmatic Data Wranglers1. PYTHON 2. C++ 3. LINUX 4. BASH 5. PANDAS 6. MATLAB 7. MACHINE LEARNING 8. POSTGRESQL

76 Common Skill Sets for: Big Data Experts1. JAVA 2. HADOOP 3. MYSQL 4. JAVASCRIPT 5. SCALA 6. SPRING 7. SPARK 8. MAVEN

77 Common Skill Sets for: Data Architects1. SQL 2. MS 3. ORACLE 4. SERVER 5. SSIS 6. SSAS 7. DATABASE 8. SSRS

78 Local User Groups Utah Geek Events – Intermountain Data Conference Fall 2017 Utah Hadoop User Group https://www.meetup.com/BigDataUtah/ Big Data Utah OpenWest Conference – July 12-15, 2017 https://www.openwest.org/

79 Meetup.com Groups 728 Tech and Social Meetups in Utah

80 Diabetes Data Public Anonymized Dataset 101,766 Records130 U.S. Hospitals, https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years

81 Kaiser Family Foundation Healthcare Data

82 Hospital Inpatient Days per 1,000

83

84 Supplementary Introduction to Data ScienceJohn Hopkins/Coursera Data Scientist’s Toolbox https://www.coursera.org/specializations/jhu-data-science An excellent introduction to many helpful skills although not enough to be called a true data scientist Total Price to Enroll $423 USD Includes 1 The Data Scientist’s Toolbox $29 USD 2 R Programming $49 USD 3 Getting and Cleaning Data $49 USD 4 Exploratory Data Analysis $49 USD 5 Reproducible Research $49 USD 6 Statistical Inference $49 USD 7 Regression Models $49 USD 8 Practical Machine Learning $49 USD 9 Developing Data Products $49 USD 10 Data Science Capstone $49 USD

85 Data Warehouse Recommended ReadingThe Data Warehouse Toolkit by Ralph Kimball and Margy Ross

86 Data Warehouse StructuresWhat is a Data Warehouse?

87

88 OLTP vs. OLAP OLTP OLAP User Customer Service IT ProfessionalsDecision Makers Purpose Production Business Operations Report Generation Structure 3rd Normal Form Star Schema Timeline Current Historical Data Detailed Filtered Usage Short, Simple Transactions Large Complex Queries Access Read & Write Read Mostly Records per Query Tens Millions Database Size 100 MB-GB 100 GB-TB Measurement Metric # transactions/minute Query Response Time

89 Why separate OLAP and OLTPFast Report Generation Workload on Production Systems Centralized Repository for Departmental Data Maintains Historical Records Single set of Centralized data provides Single Answer

90 3NF = 3rd Normal Form What is the average GPA of Male and Female students in the College of Education each decade?

91 Star Schema (Dimensional Model)Fact and Dimension Table

92 Hospital Patients 3NF

93 Hospital Claims 3NF

94 Example Dimensional Model Design

95 Star Schemas Enrollment, Claims, Prescriptions, Care Management

96 Data Warehouse Bus MatrixBusiness Processes/ Dimension Date Product Promotion Website Customer Supplier Region Sales X Billing Inventory Shipping Market Analysis Partners

97 Kimball vs Inmon Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model. Bottom-Up Approach "...The data warehouse is nothing more than the union of all the data marts...," Ralph Kimball, December 29, 1997.

98 Kimball vs Inmon Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. Top-Down Approach "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," Bill Inmon, January 8, 1998.

99 Kimball vs Inmon Whose methodology is more popular?Dice.com technical job listing board 85,000 job listings, average turn over 14 days Data Scientist postings job listings for Kimball 28-39 for Inmon 27-37 Inmon and Kimball Slides and Feedback at:

100 Cube

101 Cube

102 Trimmed Sample Data Day Weekend Zip State Region Division Group Quantity Thu No 95014 CA West Sporting Goods Skis 300 90210 400 Fri 19090 PA East Bikes 200 20500 DC 600 Sat Yes 100 Sun 800 Clothing Men’s Pants Note Hierarchies: Day > Weekend; Zip > State > Region

103 SQL Analytic StandardsAnalytic Function Oracle 10g SQL Server 2008 MySQL 5.0 DB2 PostgreSQL ANSI SQL:92 ANSI SQL:99 ANSI SQL:2003 ANSI SQL:2008 ANSI SQL:2011 CUBE Y No DENSE_RANK FIRST FIRST_VALUE LAST LAST_VALUE LAG LEAD NTILE RANK RATIO_TO_REPORT ? ROWS UNBOUNDED PRECEDING | CURRENT ROW | FOLLOWIN SQL 2012 NTH_VALUE Oracle Only LISTAGG STUFF(FOR XML) group_concat() STRING_AGG BOTTOM ROWNUM TOP LIMIT FETCH FIRST N AVG COUNT MAX MIN ORDER BY PARTITION BY ROLLUP ROW_NUMBER SUM WITH What are the Aggregate Functions? Define Aggregate Blank Fields and “?” have unknown values

104 SQL Analytic FunctionsSQL Analytics Aggregate: A single value derived from multiple values SUM, COUNT, AVG, MIN, MAX Partition, Group, Window Rank: Sorted order of values RANK, DENSE_RANK, ROW_NUMBER TOP, BOTTOM NTILE, RATIO_TO_REPORT Related Columns: Values of nearby rows and columns FIRST, LAST (FIRST_VALUE, LAST_VALUE) LAG, LEAD Grouping: Dividing large groups into associated small groups ROLLUP, CUBE, LISTAGG