1 Observational data: shifting the paradigm from randomized clinical trials to observational studiesMichal Rosen-Zvi, PhD Director, Health Informatics, IBM Research CIMPOD, February 2017
2 Without the aid of statistics nothing like real medicine is possible. Pierre-Charles-Alexandre Louis Pierre-Charles-Alexandre Louis (14 April 1787 – 22 August 1872[2]) was a French physician, clinician and pathologistknown for his studies on tuberculosis, typhoid fever, and pneumonia, but Louis's greatest contribution to medicine was the development of the "numerical method", forerunner to epidemiology and the modern clinical trial.[3] COGNITVE HEALTHCARE ASSISTANT is achievable when combining advanced statistics with computer technologies
3 Paradigm shift “If you find that [a] study was not randomized, we'd suggest that you stop reading it and go on to the next article.” [Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence-based medicine: how to practice and teach EBM. New York: Churchill Livingtone, 1997] 136 articles in 19 treatment areas [published between 1985 and 1998] The estimates of the effects of treatment in observational studies and in randomized, controlled trials were similar in most areas N Engl J Med 2000; 342:
4 Comparative Effectiveness Cohort Studies Pharmaceutical companies interest in RWE Pharmacovigilance Comparative Effectiveness Cohort Studies Clinical Decision Support Systems Adherence Drug Repurposing
5 Hospitals and Insurers top goals for analytics wereINFORMATION WEEK, MARCH 2013, “HEALTHCARE ORGANIZATIONS GO BIG FOR ANALYTICS” Hospitals and Insurers top goals for analytics were identifying at-risk patients (66%) tracking clinical outcomes (64%) performance measurement and management (64%) clinical decision making at the point of care (57%) Between 30% and 40% of the respondents also expressed interest in mining data from mobile devices, social networks and unstructured clinical data. Health plan providers focused more on these sources than doctors did.
6 Medical knowledge StatisticsDescriptive Statistics Dimensionality Reduction Clustering Causal inference Hypothesis Testing Similarity Analytics Predictive Analytics Deep Learning Reinforcement Learning Decision Analytics Medical knowledge Textual Data Image Data Omic Data Sensor Data Behavioral Data Psychology Economics Game Theory Machine Learning Statistics
7 Machine Learning Supervised Learning Statistics; Data MiningLearning from data samples Statistics; Data Mining Supervised Learning Unsupervised; Semi-supervised Samples are labeled Classification The labels represent association with one of a few classes Regression; Ranking The learner cannot select samples to label Passive Learning Active Learning Training is performed independently of the testing Batch Learning Online Learning Machine learning: probabilistic graphical models and applications to clinical domain, Michal Rosen-Zvi, TLV Univ. 2011/12
8 Classification Problem Definitionh Input: a set X of samples A set Y of labels. In binary classification usually {0,1} or {-1, 1} A training dataset S = {(x1,y1), (x2,y2), (x3,y3), …, (xm,ym)} Output: A hypothesis (prediction rule) h: X Y Can be used for prediction on new samples from X Learning algorithm: selects a good hypothesis from a predefined hypotheses class H Give example for every item
9 Risk A loss function is a measure of the classification qualityExample: the 0-1 loss: Risk – the expected loss: Assuming a distribution D over the data XxY, the risk is the expected probability of returning a wrong prediction on a sample drawn randomly from D The learning algorithms aims to find a hypothesis with a minimal risk:
10 ] T. Hastie, R. Tibshirani, and J] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer.
11 Training Vs. Test Error The hypotheses class H should be complicated enough to capture important properties of the data But too complex hypotheses may cause overfit
12 Occam’s Razor 14th-century English logician, theologian and Franciscan friar Occam’s razor is a guiding principle for explaining phenomena "Plurality must never be posited without necessity" When considering a few explanations of the same phenomenon choose the simplest one, having fewest parameters
13 Bias-Complexity (Bias-Variance) TradeoffTwo components contribute to the generalization error: Approximation error – due to the final size of our hypotheses class H Inherent bias since H does not necessarily contain the true hypothesis Decreases as |H| grows Estimation error – due to the final training set Increases with the size (complexity) of H Variance increases with the size of H Decreases with m (the training set size)
14 Loss Function T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer.
15 Noise vs Bias Aiming at robustness - reduce variance of answersKahneman, Rosenfield, Gandhi & Blaser showed that a learning algorithm can detect the noisy cases and clean those
16 Designing a decision support systemCreating a system that provides a recommendation of what would be best intervention from a final set of potential interventions requires the following Need to address all aspects of the PICOT format - patient population of interest (P), intervention or area of interest (I), comparison intervention or group (C), outcome (O), and time (T) Define ‘best’ – typically done by defining Outcome as binary (good/bad), ranked list (different levels of achievements) or with continuous variables that can be measured some Time after the Intervention of interest. Second step – define the Population of interest, if relevant the Comparison groups as well, the features to be used for making the decision and clean outliers
17 About AIDS/HIV
18 HIV 35 M At the end of 2013, 35 million people were living with HIV 70% 70% of the people living with HIV, live in Sub Saharan Africa איידס זאת מחלה שעברה שינויים רבים, ממחלה סופנית למחלה כרונית, ממחלה שאפיינה בעיקר קהילות מסוימות, עד כדי כך שקרא לה Gay-related immune deficiency (GRID) was the name first proposed in 1982 to describe an "unexpected cluster of cases"[1] of what is now known as AIDS 90% 90% of the children living with HIV, live in Africa
19 The life cycle of the virusRelevant drugs include Protease Inhibitors Reverse Transcriptase Inhibitors Integrase Inhibitors
20 HIV: EuResist 65,000 Data coming from 10 European centers covers medical records of 65,000 patients in the past 20 years 160,000 Information for 160K therapy regimens provided to the patients 50,000 x 400 The double-edged sword – on the one hand we have data we can use from european databases to predict the best treatment and help people. The other edge is to use a ground up approach with demographic and geographic data combined with medical literature to prevent HIV, increase awareness, and reduce the number of patients… Information of 200 million amino acids of the virus RT and PRO proteins
21 Short-term model: 4-12 weeksStandard datum definition CD4 Reason for switch Treatment switch Short-term model: 4-12 weeks Viral load Genotype 0-90 days time Patient demographics (age, gender, race, route of infection) Past genotypes Past treatments Past AIDS diagnosis 21
22 Three engines The Evolutionary Engine uses mutagenetic trees to compute the genetic barrier to drug resistance The Generative Discriminative Engine employs a Bayesian network modeling interactions between current and past antiretroviral drugs The Mixture of Effects Engine includes second and third- order variable interactions between drugs and mutations
23 Different prediction algorithms, different resultsIBM joined Yale in a study of the status of mother to child hiv transmission.
24 Comparison of performancesA comparison of the three engines prediction on failure or success therapy – where they fail or succeed together and where there is a single winner In the training (test) set 350 (35) failing therapies are predicted to be successful by all three engines. 145 (16) of these achieve a VL measure below 500 copies per mililiter once during the course of therapy. Of the remaining 550 (64) failing cases in the training (test) set 100 (13) have a VL measure below 500 copies per mililiter once during the course of the therapy. A Fisher's Exact test results in a p-value of 4.810-14 (0.011) on the training (test) set. Tend to dissagree on failure therapies. We found out that indeed these therapies are more noisy as they might have been a success in the short term. “Happy families are all alike; every unhappy family is unhappy in its own way” Leo Tolstoy, Anna Karenina, Chapter 1, first line
25 EuResist partners @ EHR meeting, 27/03/2007Thank You תודה Danke Grazie Köszönöm Tack
26 Designing a decision support system (Cont)Last step can be performed using one of the following approaches Embed patients in a metric and recommend intervention based on similarity Predict outcome for different intervention and use the prediction (e.g. likelihood of success in the binary case) to rank recommendations Predict what would be the intervention, performed as a multi-label challenge, requires cleansing data based on outcome. In other words, predict the physician choice, might want to learn only from past good choices as defined by the outcome.
27 Selection bias Selection bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. Thirty-five percent of published reanalyses led to changes in findings that implied conclusions different from those of the original article about the types and number of patients who should be treated. Ebrahim S, Sohani ZN, Montoya L, Agarwal A, Thorlund K, Mills EJ, Ioannidis JPA. Reanalyses of Randomized Clinical Trial Data. JAMA. 2014;312(10):
28 Multinomial distribution/ Gamma Function
29 Naïve Bayes classifier: words and topicsA set of labeled documents is given: {Cd,wd: d=1,…,D} Note: classes are mutually exclusive cd=D . . . Food Dry Bread Milk Eat c1=2 Pet Dog Cat Milk c1=8
30 Simple model for topicsGiven the topic words are independent The probability for a word, w, given a topic, z, is wz C W Nd D P({w,C}| ) = dP(Cd)ndP(wnd|Cd,)
31 A classification algorithm
32 Evaluation of multi-classConfusion matrix Predicted C=1 Predicted C=2 Predicted C=3 True C=1 20 2 1 True C=2 3 15 True C=3 6 12
33 LDA model α θd β z Φz w K Nd D
34 Sampling in the LDA modelThe update rule for fixed , and integrating out Provides point estimates to and distributions of the latent variables, z.
35 The generative processLet’s assume authors A1 and A2 collaborate and produce a paper A1 has multinomial topic distribution q1 A2 has multinomial topic distribution q2 For each word in the paper: Sample an author x (uniformly) from A1, A2 Sample a topic z from a qX Sample a word w from a multinomial topic distribution
36 Inference in the author topic modelEstimate x and z by Gibbs sampling (assignments of each word to an author and topic) Estimation is efficient: linear in data size Infer from each sample using point estimations: Author-Topic distributions (Q) Topic-Word distributions (F)
37 Data and Topic Models Author-topic-word model for 70k authors and 300 topics built from 162,489 Citeseer abstracts Each word in each document assigned to a topic For the subset of 131,602 documents that we know the year Group documents by year Calculate the fraction of words each year assigned to a topic Plot the resulting time-series, 1990 to 2002 Caveats Data set is incomplete (see next slide) Variability (noise) will be high for 2001 and 2002
38
39 Trends within Database Research
40 NLP and IR
41 Rise in Web/Mobile topics
42 (Not so) Hot Topics
43 Vision and Robotics
44 Decline in programming languages, OS, ….
45 Polya’s Urn George Pólya (/ˈpoʊl.jə/; Hungarian: Pólya György, pronounced [ˈpoːjɒ ˈɟørɟ]; December 13, 1887 – September 7, 1985) was a Hungarian mathematician. He was a professor of mathematics from 1914 to 1940 at ETH Zürich and from 1940 to 1953 at Stanford University. He made fundamental contributions to combinatorics, number theory, numerical analysis and probability theory. He is also noted for his work in heuristics and mathematics education.[2]
46 Binary Case
47 Metric –distance functionNon negative Identity Symmetry Triangle inequality Kullback-Leibler Diversion
48 K-means Pick initial set of k means: {m}Iterate until convergence on two steps – assignment and update
49 Jensen-Shannon DivergeneceSymmetric Smooth
50 Retrospective Study of Effectiveness of a treatmentZ=1 (Old treatment) Z=0 (New treatment) Y=1 (Success) 210 262 Y=0 (Failure) 201 327 Success Ratio 51.1% 44.5% The average treatment effect: E[Y(Z=1)- Y(Z=0)] P(Y=1|Z=1)*1+ P(Y=0|Z=1)*0-[P(Y=1|Z=0)*1+ P(Y=0|Z=0)*0]
51 Simpson Paradox Z=1 (Old treatment) Y=1, Y=0, Success ratio Z=0 (New treatment) Y=1, Y=0, Success ratio X1=1 (Severe) 46 86 34.9% 136 252 35.1% X1=0 (Mild) 164 115 58.8% 126 75 62.7% Z=1 (Old treatment) Z=0 (New treatment) Y=1 (Success) 210 262 Y=0 (Failure) 201 327 Success Ratio 51.1% 44.5%
52 The average treatment effect E[Yi(1) − Yi(0)] = P(Y=1|Z=1)*1+P(Y=0|Z=1)*0-[P(Y=1|Z=0)*1+P(Y=0|Z=0)] E[Yi(1) − Yi(0)] = =0.066 Knowing about the confounder E[Yi(1) − Yi(0)] = [P(X1=1)*P(Z=1|X1=1)*P(Y=1|Z=1)*1+ P(X1=0)* P(Z=1|X1=0)*P(Y=1|Z=1)*1]-[P(X1=1)*P(Z=0|X1=1)*P(Y=1|Z=0)*1+ P(X1=0)*P(Z=0|X1=0)*P(Y=1|Z=0)*1] 0.5*0.282* *0.611* [0.5*0.718* *0.389*0.489] =
53 Naive Bayes . . . x1=1 x2=1 x3=1 x4=1 x5=1 x6=1 x7=1 x8=1 x9=1 x10=1Z=1 y=1 0.386 0.498 0.481 0.520 0.536 0.468 0.542 0.487 0.521 0.528 0.445 y=0 0.640 0.496 0.519 0.456 0.460 0.470 0.381 P(Y=1|Z,{X}) Y . . . X1 X2 XN Z
54 Naïve Bayes classifierP(Y=1|Z=1) = P(Z=1,Y=1)/P(Z=1)= P(Z=1|Y=1) P(Y=1) /P(Z=1) =0.445*0.5/(0.445* *0.5)=0.539 P(Z=1|X1=1)= P(Z=1, X1=1)/P(X1=1)= [P(Z=1, X1=1|Y=1)P(Y=1)+ P(Z=1,X1=1|Y=0)P(Y=0)]/ [P(X1=1|Y=1)P(Y=1)+P(X1=1|Y=0)P(Y=0)]= [0.445* *0.640]/[ ]=0.503
55 Sigmoid Function P(Y=1)=1/(1+Exp(-WX))Xi=0/1 Drug was administrated no/yes Z=0/1 Obtained new/old treatment Y=0/1 Failure/Successful treatment Drug N Outcome Treatment Drug 1 Drug 2 Drug 3
56 Code generating the data (matlab/octave)X=randi([0,1],1000,10); WZ = [ ]; WY = [ ]; tZ=mtimes(WZ,X'); tY=mtimes(WY,X'); pZ=1./(1+exp(-1*tZ)); pY=1./(1+exp(-1*tY)); Z=binornd(1,pZ); Y=binornd(1,pY); octave:127> tY=mtimes(WY,X'); octave:128> tZ=mtimes(WZ,X'); octave:129> pY=1./(1+exp(-1*tY)); octave:130> pZ=1./(1+exp(-1*tZ)); octave:131> Y=binornd(1,pY); octave:132> Z=binornd(1,pZ); octave:133> old1 = X(:,1)'+X(:,2)'; for i=1:10 > > old1 = X(:,i)'; > > cooc(i,1)=length(find(old1==1 &Z==1) ); > > cooc(i,2)=length(find(old1==1 &Z==0) ); > > cooc(i,3)=length(find(old1==1 &Y==1) ); > > cooc(i,4)=length(find(old1==1 &Y==0) ); > > cooc(i,5)=length(find(old1==1 &Y==1 & Z==1) ); > > cooc(i,6)=length(find(old1==1 &Y==0 & Z==1) ); > > cooc(i,7)=length(find(old1==1 &Y==1 & Z==0) ); > > cooc(i,8)=length(find(old1==1 &Y==0 & Z==0) ); > > end
57 True model probabilitiesP(Y=1|Z)=SumX{ P(Y=1,Z|X)/ P(Z|X) P(X) } =SumX{P(Y=1|X)P(Z|X)/ P(Z|X) P(X) } The average treatment effect for the true model is 0 It does not matter what the value of Z is DeNumSum=0; for i=1:2 for j=1:2 for k=1:2 for m=1:2 for l=1:2 for n=1:2 for a=1:2 for b=1:2 for c=1:2 for d=1:2 x(1)=i-1; x(2)=j-1; x(3)= k-1; x(4)=m-1; x(5) =l-1; x(6)=n-1; x(7)=a-1; x(8)=b-1; x(9)=c-1; x(10)=d-1;DeNumSum=DeNumSum+1/(1+exp(-1*mtimes(WZ,x')));end end end end end end end end end end NumSum=0; for i=1:2 for j=1:2 for k=1:2 for m=1:2 for l=1:2 for n=1:2 for a=1:2 for b=1:2 for c=1:2 for d=1:2 x(1)=i-1; x(2)=j-1; x(3)= k-1; x(4)=m-1; x(5) =l-1; x(6)=n-1; x(7)=a-1; x(8)=b-1; x(9)=c-1; x(10)=d-1;NumSum=NumSum+1/(1+exp(-1*mtimes(WY,x')))/(1+exp(-1*mtimes(WZ,x')));end end end end end end end end end end
58 Propensity score The probability of a person being assigned to a particular treatment given a set of observed covariates. P(Z=1|X) If the treatment and control groups have identical propensity score distributions, then all the covariates will be balanced between the two groups “no unmeasured confounders” assumption: the assumption that all variables that affect treatment assignment and outcome have been measured In the example data, there is a big different between X1=1 and X1=0 P(Z=1|X1=1) = P(Z=1|X1=0) = Given two patients: Xi i=2:10 identical and X1 different, the treated and untreated groups are unbalanced
59 Inverse Probability of Treatment Weighting Using the Propensity Scoreei= P(Z=1|Xi); propensity score Averaged treatment effect Calculate the averaged treatment effect given the model 1/[(1+exp(-WYX)) (1+exp(-WZX))]/[1/(1+exp(-WZX))]-1/(1+exp(-WYX)) [1-1/(1+exp(-WZX))]/[1-1/(1+exp(-WZX))]=0
60 Propensity score matchingCalculate the propensity score per unit (patient) Find units in the treated/intervened and untreated/no intervention groups that has similar scores Generate a new data with two groups where the participants are selected based on matched propensity scores Typically the final dataset is smaller than the original Use the newly generated data to calculate the average treatment effect
61 Causal concepts Causal effect of a treatment/intervention involves the comparison between outcomes have the unit was applied to (a patient was subjected to the intervention) Assuming treatment/intervention is compared each independently at the same conditions/time Note The definition depends on the potential outcome but it does not depend on which outcome was actually observed The causal effect is the comparison of the potential on the same unit at the same conditions in time post-intervention
62 Estimation of causal effectRequires understanding of the assignment mechanism Consistent model of the data generation enables detection of causal effects Causality estimands are comparisons of the potential outcomes that would have been observed under different exposures of units to treatments/interventions Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction Imbens, Guido W.; Rubin, Donald B.
63 Medicine begins with storytellingPatients tell stories to describe illness; doctors tell stories to understand it. Science tells its own story to explain diseases
64 AI based tools being used by physiciansAIDS: Stanford HIVDB, EuResist, more Heart: First FDA Approval For Clinical Cloud-Based Deep Learning In Healthcare (Deep learning, 1000 images, support radiologists) Septic alert (personalized prediction of severe sepsis) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC / https://en.wikipedia.org/wiki/John_Snow#Cholera The data deluge Medical Pendulum (estrogen replacement for women after menopause) Overstesting Why most published research findings are false, PLOS Medicine Leo Anthony Celi A “datathon” model to support cross-disciplinary collaboration
65 Open Challenges Causality High dimensional very heterogynous dataEver learning systems Privacy preserving
66 Hippocratic Oath I swear by Apollo The Healer, by Asclepius, by Hygieia, by Panacea, and by all the Gods and Goddesses, making them my witnesses, that I will carry out, according to my ability and judgment, this oath and this indenture. To hold my teacher in this art equal to my own parents; to make him partner in my livelihood; when he is in need of money to share mine with him; to consider his family as my own brothers, and to teach them this art, if they want to learn it, without fee or indenture; to impart precept, oral instruction, and all other instruction to my own sons, the sons of my teacher, and to indentured pupils who have taken the physician’s oath, but to nobody else. I will use treatment to help the sick according to my ability and judgment, but never with a view to injury and wrong-doing. Neither will I administer a poison to anybody when asked to do so, nor will I suggest such a course. Similarly I will not give to a woman a pessary to cause abortion. But I will keep pure and holy both my life and my art. I will not use the knife, not even, verily, on sufferers from stone, but I will give place to such as are craftsmen therein.Into whatsoever houses I enter, I will enter to help the sick, and I will abstain from all intentional wrong-doing and harm, especially from abusing the bodies of man or woman, bond or free. And whatsoever I shall see or hear in the course of my profession, as well as outside my profession in my intercourse with men, if it be what should not be published abroad, I will never divulge, holding such things to be holy secrets. Now if I carry out this oath, and break it not, may I gain for ever reputation among all men for my life and for my art; but if I transgress it and forswear myself, may the opposite befall me.[