Paul G. Shekelle, MD, PhD West Los Angeles Medical Center RAND Health

1 Paul G. Shekelle, MD, PhD West Los Angeles Medical Cent...
Author: Ashlee McCarthy
0 downloads 0 Views

1 Paul G. Shekelle, MD, PhD West Los Angeles Medical Center RAND HealthCan We Make Producing Practice Guidelines More Efficient? …and if we can, how will we know if the methods are trustworthy? Paul G. Shekelle, MD, PhD West Los Angeles Medical Center RAND Health

2 Disclosures Research Funding Consulting RoyaltiesAgency for Healthcare Research and Quality Department of Veterans Affairs American College of Physicians Consulting ECRI – National Guidelines Clearinghouse Royalties Up to Date

3 Institute of Medicine: Trustworthy Guidelines CriteriaBe based on a systematic review Be developed by a multidisciplinary panel of experts Consider important patient subgroups Be based on explicit and transparent process Clearly explain alternative care options and health outcomes; rate quality of evidence and SoE Be reconsidered and revised as appropriate

4 Institute of Medicine: Trustworthy Guidelines CriteriaBe based on a systematic review Be developed by a multidisciplinary panel of expertise Consider important patient subgroups Be based on explicit and transparent process Clearly explain alternative care options and health outcomes; rate quality of evidence and SOE Be reconsidered and revised as appropriate Likely targets for increasing efficiency

5 Using Machine Learning to Perform Update SearchesThis method takes advantage of the decisions made in the literature search-and-screening process from original systematic review

6 Using Machine Learning to Perform Update SearchesFrom the original systematic review Need the citations from the original search Need the studies included as evidence in original review

7 Using Machine Learning to Perform Update SearchesFrom the original systematic review Need the citations from the original search Need the studies included as evidence in original review N = 5,000 Citations N = 50 included as evidence N = 4,950 not included as evidence

8 Using Machine Learning to Perform Update SearchesFrom the original systematic review Need the citations from the original search Need the studies included as evidence in original review N = 5,000 citations N = 50 included as evidence N = 4,950 not included as evidence “True Positives” “True Negatives”

9 Using Machine Learning to Perform Update SearchesCompare “true positives” to “true negative” using the “bag of words” approach Construct prediction formula Apply formula to updated search to produce rank order of citations on probability of being evidence

10 Using Machine Learning to Perform Update SearchesHow well does the method work? We tested our first version of machine learning retrospectively on two AHRQ comparative effectiveness review updates on drugs for low bone density (LBD) and for atypical antipsychotic drugs (AAP) Dalal, S. R., Shekelle, P. G., Hempel, S., Newberry, S. J., Motala, A., & Shetty, K. D. (2012). A pilot study using machine learning and domain knowledge to facilitate comparative effectiveness review updating. Medical Decision Making, X

11

12

13 Using Machine Learning to Perform Update SearchesWhat are the false negatives? For LBD, at a probability 0.02, there were 26 false negatives 25 were non–RCTs (meta-analyses, case-control studies, retrospective analyses of claims databases, and analyses of government registries 1 RCT (about raloxifene) was a false negative because it was tagged with “pharmacology” rather than something more specific For AAP, at a probability of 0.01, only one false negative occurred This was because it was tagged as a “letter” It would have been detected in reference mining of other includes

14 Using Machine Learning to Perform Update SearchesWe applied this prospectively to (yet another) low bone density update search We compared 3 different update search strategies We used a new machine learning method, one that did not rely on index terms, and therefore can be used for databases other than MEDLINE We developed the machine-learned predictions using the original review as the training set, and chose the threshold such that sensitivity = 1.0 for the original search For pragmatic reasons, we used only 1 (experienced) reviewer for each search

15 Using Machine Learning to Perform Update Searches“Full Monty” N = 12,131 “Machine Learning” N = 2,112 “Surveillance Method” = Limited Journals N = 2,843

16 Using Machine Learning to Perform Update Searches“Full Monty” N = 12,131 “Machine Learning” N = 2,112 “Surveillance Method” = Limited Journals N = 2,843 34 articles included in update

17 Using Machine Learning to Perform Update Searches“Full Monty” N = 12,131 “Machine Learning” N = 2,112 “Surveillance Method” = Limited Journals N = 2,843 32 identified 33 identified 14 identified 34 articles included in update How many true positives were identified?

18 Using Machine Learning to Perform Update Searches“Full Monty” N = 12,131 “Machine Learning” N = 2,112 “Surveillance Method” = Limited Journals N = 2,843 2 titles rejected at title stage 1 title included as supporting evidence for overuse of DXA scans 20 titles all published in journals not included in the search. None were pivotal studies changing conclusions. How many true positives were missed?

19 Using Machine Learning to Perform Update SearchesAnother retrospective application: update for gout management 4 key questions Treatment of acute gout Treatment of hyperuricemia Monitoring Effects of discontinuation Original report Initial Search N = 6,502 titles Acute gout N = 25 Hyperuricemia N = 19 Monitoring N = 8 Discontinuation N = 3

20 Using Machine Learning to Perform Update SearchesUpdate Gout Search N = 1,134 titles

21 Using Machine Learning to Perform Update SearchesUpdate Gout Search N = 1,134 titles 11 articles included in the update

22 Using Machine Learning to Perform Update SearchesUpdate Gout Search N = 1,134 titles 10 identified with a reduction of 83% in the number of titles 11 articles included in the update How many true positives were identified?

23 Using Machine Learning to Perform Update SearchesUpdate Gout Search N = 1,134 titles 1 title missed was a research letter assessing the association between HLA-B*58:01 and allopurinol use with severe skin reactions, an association already reported on in the original report How many true positives were missed?

24 Using Machine Learning to Perform Update SearchesConclusions Machine learning has substantial promise as a method to increase the efficiency of update searches, can decrease screening by > 50% It is currently most useful for topics that generate larger numbers of hits Reference mining of included studies should still be performed Some false negatives will occur, but in our experience none have been pivotal I expect it will get better with each subsequent update cycle

25 Attempts to Increase Efficiency of Collecting Multidisciplinary Expert InputMethodologically this is much more challenging than studies of efficiencies in literature searches, due to the inherent variability in any expert panel process Estimates of the chance-corrected agreement between technical expert panels assessing the same topic are in the Kappa = 0.5 – 0.8 range Hence any comparison of the standard method to a more efficient method will have a hard time controlling for the variability inherent in the standard method

26 Attempts to Increase Efficiency of Collecting Multidisciplinary Expert InputAlternative ways to the traditional f-t-f method to collect expert judgement Formal ratings done through the mail compared to f-t-f Kappa = 0.5 – 0.7 for hysterectomy and coronary revascularization Kappa = 0.6 for cataract surgery Online moderator-facilitated formal ratings Kappa = 0.4 for 1 (very hard) topic Drop off in participants between round (66% 62% 87%; 36% participation of original invitees; 54% participation in all three rounds of initial respondents) Washington, D. L., Bernstein, S. J., Kahan, J. P., Leape, L. L., Kamberg, C. J., & Shekelle, P. G. (2003). Reliability of clinical guideline development using mail-only versus in-person expert panels. Medical care, 41(12), Tobacman, J. K., Scott, I. U., Cyphert, S. T., & Zimmerman, M. B. (2001). Comparison of appropriateness ratings for cataract surgery between convened and mail-only multidisciplinary panels. Medical Decision Making, 21(6), Khodyakov, D., Hempel, S., Rubenstein, L., Shekelle, P., Foy, R., Salem-Schatz, S., ... & Dalal, S. (2011). Conducting online expert panels: a feasibility and experimental replicability study. BMC medical research methodology, 11(1), 1.

27 Attempts to Increase Efficiency of Collecting Multidisciplinary Expert InputIn our work using formal expert panel processes developing performance measures, we often observe situations where a first-round, private vote produces near- consensus that a particular process should be done, but this gets substantially more conservative following the f-t-f discussion as the experts consider all the “exceptions-to- the-rule” Hence, I believe it likely that non f-t-f methods will perform differently for judgements requiring re-affirmation of the status quo or only minor changes to an existing recommendation compared to the development of entirely new recommendations

28 Attempts to Increase Efficiency of Collecting Multidisciplinary Expert InputConclusions Expert input probably does not always have to be gathered via a f-t-f meeting We have used teleconference, online methods such as SurveyMonkey, and moderator-facilitated methods such as ExpertLens, with apparent success

29 Attempts to Increase Efficiency of Collecting Multidisciplinary Expert InputConclusions I believe non f-t-f methods are most trustworthy when they are used to: Re-affirm existing guideline statements Make minor modifications, such as adding or subtracting a drug to existing guideline statements Adding or subtracting harms I believe f-t-f methods, or at least teleconference methods, are still going to be required for major changes in guideline recommendations

30 We Will Need New Criteria to Assess Whether Updated Guidelines are TrustworthyImagine the following: A developer commits to a “continuously updated guideline” or “living guideline” Monthly or quarterly searches are done for new relevant evidence New evidence is sent to the technical expert panel for their review The technical experts re-affirm the validity of the existing guideline recommendations or modify the recommendations to reflect new evidence

31 We Will Need New Criteria to Assess Whether Updated Guidelines are TrustworthyImagine the following: A developer commits to a “continuously updated guideline” or “living guideline” Monthly or quarterly searches are done for new relevant evidence New evidence is sent to the technical expert panel for their review The technical experts re-affirm the validity of the existing guideline recommendations or modify the recommendations to reflect new evidence Is this equivalent to “being based on a systematic review”? What kind of search was performed? A full search? Surveillance search? Machine learned search? Were titles screened in duplicate?

32 We Will Need New Criteria to Assess Whether Updated Guidelines are TrustworthyImagine the following: A developer commits to a “continuously updated guideline” or “living guideline” Monthly or quarterly searches are done for new relevant evidence New evidence is sent to the technical expert panel for their review The technical experts re-affirm the validity of the existing guideline recommendations or modify the recommendations to reflect new evidence What was sent to the technical experts? The original studies? A critical appraisal of the original studies? How was this new evidence presented compared to the existing evidence?

33 We Will Need New Criteria to Assess Whether Updated Guidelines are TrustworthyImagine the following: A developer commits to a “continuously updated guideline” or “living guideline” Monthly or quarterly searches are done for new relevant evidence New evidence is sent to the technical expert panel for their review The technical experts re-affirm the validity of the existing guideline recommendations or modify the recommendations to reflect new evidence How was the input collected? How many experts participated in all phases of the process?

34 Can We Make Producing Practice Guidelines More Efficient?Conclusions Efficiencies are possible in the methods of updating guidelines with regard to searching for new evidence and collecting expert input We need new IOM-like criteria to assess whether the new methods are trustworthy It is possible to validate new methods for searching and I believe only validated methods are trustworthy Non f-t-f methods of collecting expert input are much harder to validate, but I believe at a minimum should be described in terms of the method used and how many experts contributed across all rounds of deliberation (and how this compares to the process used to develop the original guideline)

35

36 Back-ups

37

38