April 6, 2017 – NADDI Conference, Cornell University

1 April 6, 2017 – NADDI Conference, Cornell UniversityCap...
Author: Stuart Richards
0 downloads 0 Views

1 April 6, 2017 – NADDI Conference, Cornell UniversityCapturing Metadata Early In The Research Data Lifecycle Barry T. Radler, PhD University of Wisconsin-Madison Institute on Aging My presentation originates from my efforts to convert my local survey research center staff (at UW) into DDI true believers. The goal was to report progress made on creating a DDI-compliant Word template for survey instrument development. What follows is not so much a report on results, but more musings and observations about our experience attempting to capture metadata early in the research data lifecycle. April 6, 2017 – NADDI Conference, Cornell University

2 Overview Background The Ideal DDI in TheoryCapture (variable-level) metadata earlier in lifecycle DDI in Theory Survey Metadata Capture in Practice UW Survey Center DDI-Word instrument template Conclusions Brief review of presentation: 1. Background Discuss the merits of capturing metadata early – really referring to variable-level metadata, guts of a data capture instrument or tool How DDI is meant to facilitate metadata capture How survey metadata is captured in practice The UWSC’s experience with creating a Word template for survey instrument creation Conclusions

3 Background

4 Background MIDUS MIDUS DDI Portal – http://midus.colectica.orgLongitudinal multi-disciplinary study of health/well- being Complex amount of data Wide secondary usage through ICPSR DDI facilitates wide use MIDUS DDI Portal – I am a researcher and data manager for MIDUS, a longitudinal and multi-disciplinary study of aging. MIDUS takes an integrative approach to studying health and aging by combining traditional survey assessments with laboratory protocols, producing a variety of socio-demographic, health, cognitive, biomarker, and neuroscience data – we’ve even dabbled with genetics. Because we are funded by NIA, we share our data products freely through ICPSR, which has resulted in a large audience of secondary users. This situation in turn has placed a premium on a robust approach to metadata and MIDUS has been producing DDI codebooks for about 12 years. The crowning achievement of those efforts is a DDI-driven portal where users can download customized MIDUS datasets and documentation. I encourage everyone to explore it.

5 The Ideal The portal was created at the tail end of the MIDUS research data lifecycle, appending variable-level metadata from our datasets and cobbling other metadata from a variety of sources.

6 Metadata driven data capture: EDDI 2016 presentationsArchivist & Mapper: Simplifying and Modernising Questionnaire Entry - Will Poynter Questionnaire Generator - Guillaume Duffes Rich Metadata from the Start - Oliver Hopt The DASISH Questionnaire Design and Documentation Tool – Functionalities and Examples from the Tool - Benjamin Beuster, Hilde Orten Question Banks, Reusability, and DDI Dan Smith Steps towards a Single Point of Access for Survey Questions across Europe: The Euro Question Bank Project - Wolfgang Zenk-Möltgen, Azadeh Mahmoud Hashemi Document Questionnaires and Datasets with DDI: A Hands-On Introduction with Colectica - Jeremy Iverson, Dan Smith The recognition of the utility of early metadata capture seems to be evolving within the DDI community: At last December’s European DDI conference, there were more presentations about using DDI to drive instrument development and data capture than any other single topic. Those presentations are listed here. In my experience, this hasn’t always been the case at DDI conferences; they’ve tended to be heavy on technical details like the schema structure and properties, and how to harvest metadata from later stages of the lifecycle.

7 Capturing Metadata Earlier in Lifecycle“Every activity in the data life cycle should be documented as it occurs from conceptualization to publication.” – DDI Long- term Infrastructure Manifesto (forthcoming) DDI 3 “Lifecycle” The goal is laudable, notable, and, I think, central to the future success of the standard: This is a quote from DDI Longterm Infrastructure Manifesto DDI metadata ideally should be captured at each stage: This is a graphic depiction of the research data lifecycle Generated by DDI Alliance in the context of developing DDI 3, also known as DDI Lifecycle

8 Leveraging Metadata Earlier in LifecycleCapture study and instrument design metadata—once—at time of occurrence or creation More efficient and easier to capture information about the research workflow at the time of its occurrence rather than after the fact Metadata capture not realized at time of occurrence or creation leads to information loss Potentially employ metadata to drive survey administration “Leveraging” means realizing the advantages of capturing metadata early and often, as it occurs or as it is created. This would lead to less time and money wasted repeating the process of appending metadata at different points along the lifecycle. If different actors are reproducing the same metadata throughout the research data lifecycle, that is inefficient. “After-the-fact” documentation requires substantial resources and typically leads to a considerable amount of information loss and sparsely documented data. Finally, the capture of metadata early in the research process could be used to drive subsequent survey activities

9 Increased Efficiency of Metadata ProductionHere’s a graphic illustration that shows the peril and promise of metadata production. This was developed by the DDI Marketing committee (along with Jon Johnson) in the past year when we were pitching DDI to survey research organizations at AAPOR. The graphic highlights the problem of redundant metadata creation in survey research. It shows the different stakeholders and their specific responsibilities or roles throughout the research data lifecycle Starting on the upper left one can follow the sequence of a typical survey research product. Each yellow arrow indicates an information exchange between actors. But if systems, software, and actors do not exchange that information efficiently (by, say using a standard), these transitions become opportunities for a poor hand-off that can result in information loss.

10 Data Documentation Initiative in TheoryEnter: DDI, the panacea

11 The Data Documentation Initiative (DDI) is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. DDI is a free standard that can document and manage different stages in the research data lifecycle, such as conceptualization, collection, processing, distribution, discovery, and archiving. Documenting data with DDI facilitates understanding, interpretation, and use -- by people, software systems, and computer networks. DDI is intended to be a standard system to facilitate metadata capture. In addition to carefully documenting each of the measurements represented in a dataset, the DDI specification provides for full descriptions of the methodology and other study-level information. DDI can capture and document the entire research data lifecycle process, from cradle to grave.

12 Advantages of DDI: Advantages of XML:Introduces a common communication protocol to research processes Increases transparency across systems and software Interoperates with other standards such as DataCite and Dublin Core A free and open standard (XML) Advantages of XML: Is interoperable; not concerned with any particular OS Widely used data exchange standard No licenses or usage requirements Easily transformed into presentation languages such as HTML, PDF or plain text. How is it supposed to do this? Introduces a common communication protocol for research processes This, in part, allows it to increase transparency across systems and different software Also plays nice with other protocols and established standards, especially citation and bibliographic ones which expands its reach and applicability. It can promise these things because it in turn leverages the advantages of XML: XML is well-suited as a standard exchange language. A web-publishing standard for richly structured information. XML is easily transformed into presentation languages

13 DDI: One Document, Many UsesThis idea that one can create a single document to be used for a variety of purposes is summed up in the phrase: One document, many uses. Tools can utilize existing metadata from the project planning stage to instrument design to dataset production to codebook creation – and even reuse extant metadata again in a future project.

14 Metadata driven research reports“The Sponsorship on Quality recommended that quality reporting should be streamlined and rationalised across the ESS, by using the existing metadata systems and by creating a “once for all purposes” reporting strategy.“ This reusability characteristic is increasingly being recognized. This is a quote from the 2014 SIMS (Single Integrated Metadata Structure) report regarding the ESS: The ideal again: research projects should adopt a common metadata standard throughout the research lifecycle that can be repurposed.

15 Metadata Capture in PracticeIn practice, there are a number of barriers to metadata capture more generally, and to adopting DDI specifically, especially early in the research lifecycle.

16 Challenges to adopting/using DDIComplexity DDI 3.2: 1,100 tags Documentation and training Low level of researcher buy-in More appealing to large organizations, official statistics Need for tools Lower entry barriers Utilitarian tools for reuse, not one-off Organizational resistance to changes in workflow Complexity The latest version of DDI includes over 1,100 tags – this makes it comprehensive, but complex and a perceived entry barrier to potential new users. Documentation of how to use the standard itself needs to address different audiences: study designer, data producer, software designer, software programmer. Low level of researcher buy-in The economies and efficiencies are more apparent to larger projects such as official statistics Primary goal for researchers (that are UWSC’s bread and butter) are data and analyses, not metadata capture Need for tools Tools can address the perceived complexity of DDI and smooth over those entry barriers to adopting DDI early. Unfortunately we’ve seen several tools that are developed for use by only one institution. Specific projects obtain funding to produce a DDI tool that isn’t reusable in any other situation or by anyone else. Tools can also ease the introduction of changes in typical workflow patterns, integrating documentation processes as they occur; but making such changes organizationally can be monumental.

17 UWSC: A MS Word templateWhich brings me to attempting to get UWSC to drink the DDI koolaid. Knowing about all these challenges, I have been trying to convince the folks at the UWSC to adopt metadata-driven survey design and fielding processes. As DDI newbies, their perspectives have been useful because they ask questions about fundamentals that I’ve just accepted or assumed. Their ignorance has been enlightening.

18 UWSC experience Goal: Current CAI: CASESDocumentation standard that produces one source document that can be reused through lifecycle Create authoring tool that clients are familiar with (Word) Current CAI: CASES Computer-Assisted Survey Execution System DDI2 compliant Isolated from other lifecycle stages UWSC gets it: they realize the potential of reducing duplication in the instrument creation and implementation stages. The Ideal: Produce one source document that drives instrument creation. UWSC clients tend to use MS Word to produce and deliver instruments to survey research orgs. Even if UWSC buys a DDI tool to drive production, unlikely the client will be able to use it (or want to). Instead of trying to lead the horse to water, bring water to the horse – create a template with a Word front-end, and a DDI back-end. UWSC uses CASES CATI software, which is DDI2 compliant but we’ve had a spotty history extracting useful metadata from it; it uses very idiosyncratic commands (especially for routing and control) More to the point: CASES executes the processes in the middle of the lifecycle and doesn’t play nice with other software.

19 Word Template Here is a screen-shot of UWSC’s Word template with some example questions. This is work-in-progress. It is a macro-enabled file that resembles a paper SAQ.  It contains tracked changes and comments similar to those employed by programmers, project directors, and clients during development, but it also uses the hidden text font attribute to mark programming or authoring instructions. You can’t see it here, but The quick access toolbar at the top of the page contains buttons which set or unset the hidden text property, and a button which toggles whether to display hidden text: We haven’t mapped this template to DDI yet, but its creation has been influenced by the DDI schema and it’s conceptualization of Questions as comprised of discrete elements: literal question text, code lists, response domains, control constructs, interviewer instructions, etc. Some assumptions have to be made about how things behave in an instrument (question b follows a, e.g., or radio buttons imply forced choice); they can be default, but have to define what those are in order for the programmers to practically implement the instrument design and application, to instantiate the data capture. Some of this code can be included in the Word template and hidden when necessary. Most Word documents include formatting like radio buttons and such that the programmers don’t necessarily want or need; they want programming-friendly text with much of the Word crap stripped out. Also, things work well with simple cases, but break down with complex ones; any more complicated situations need to be documented and programmed manually – “artisinally.”

20 This is a spreadsheet that describes each Question and each item’s metadata – this is the most likely place where we will map these properties to the DDI schema.

21 PDF version Here’s the metadata converted into a printable self-administered PDF of the instrument. Notice the layout and graphical characteristics (alternative row shading) that are a part of UWSC’s internal style guide, but are not marked up in the Word template.

22 Web version Here’s the same information but marked up for Web display; there is some typographic HTML incorporated here (underlining and italics) but otherwise this isn’t very fleshed out.

23 CASES version Here’s the same information marked up in CASES code that would be used to create the CATI instrument. This markup follows conventions of CASES.

24 UWSC experience Obstacles: Describe how an instrument:Behaves (instrument logic and variable metadata) Looks (layout, display, graphics) Especially useful for mixed mode surveys DDI is limited in documenting display issues for production Can reference external content (URLs) We’ve encountered obstacles. UWSC wants their tool to describe both how an instrument behaves AND looks; very important to mixed mode surveys. DDI might be more amenable to describing instrument logic programmatically; DDI is more limited in describing display issues needed to drive instrument development and fielding DDI cannot account for display details or nuances in a standardized way; this is the result of a conscious decision by DDI developers years ago that most display issues were beyond DDI’s purview. DDI’s saving grace is that it can point to a copy of the finalized instrument hosted online somewhere by including the instrument’s URL in Notes fields; but it can’t drive instrument creation down to typographic level. NOTE smiley face response scale!

25 Metadata and survey mode“One important finding, which was not part of the original remit of this investigation, is awareness of how much harder it is to include in the study documentation a questionnaire that has been developed for collecting data on an electronic device rather than on paper. HDSS, which moved to electronic data collection using specialist software like CSPro, need to be aware that for documentation purposes they need to develop paper versions of the questionnaire for explanatory purposes, or supply the code and its interpretation (e.g., as screen shots) as part of the documentation package.” Chifundo Kanjala, Jim Todd, David Beckles, Tito Castillo, Gareth Knight, Baltazar Mtenga, Mark Urassa, and Basia Zaba. (2016). Open-access for existing LMIC demographic surveillance data using DDI. IASSIST Quarterly, Summer. Here is another acknowledgement of the difficulty in documenting different survey modes with DDI. Taken from an 2016 article comparing how a couple DDI tools performed in demographic surveillance surveys. These authors found it more difficult to document electronic instruments than printed ones.

26 UWSC experience Obstacles: Whose metadata is important?Different types/forms of metadata Producers Users Another fundamental distinction posed problems for UWSC, in that the metadata critical to developers and programmers (producers) isn’t so critical to end users.

27 Different actors, different metadata needsTwo stakeholders with competing interests: The data collector (producer/designer) wants to document the project management processes involved from conceptualization to fielding of final instrument. The client (user/analyst) wants to document the results produced by the final instrument and any fielding occurrences that can affect the interpretation of those results. Different actors value or cherish different metadata at each stage of this lifecycle. Perspectives don’t necessarily overlap regarding what metadata is important to capture along the process. Producers are more interested in an audit-trail and documenting the development history of the instrument. Users are more interested in what the final instrument produces.

28 Different actors, different metadata needsFrom SIMS report: “Only a certain level of detail and only some of the quality concepts are of interest to the general users of European statistics who are mainly interested in the statistical outputs. On the other hand, all detailed quality concepts (up to the lowest level of detail) are of interest to the producers of European statistics who are also interested in the statistical production processes. Some of the concepts are of interest to both groups.” Citing the SIMS report again, it is apparent that other organizations recognize this too, that different stakeholders require different outputs from the metadata stream and the research data lifecycle.

29 Conclusions

30 Capturing metadata early - ConclusionsCapturing metadata early in the research data lifecycle One DDI document → repurposed for multiple uses Reduce redundancy and information loss Technical issues Across different platforms and systems Instrument behavior and display across modes of administration Non-technical issues Distinct and non-overlapping metadata needs Within organizations and across different stakeholders Study-level metadata not as problematic as variable-level? AAPOR Transparency Initiative Capturing metadata early with DDI makes a lot of sense. I want to end by suggesting one area in which this template idea might make more sense.

31 DDI-Word template later in data lifecycleStudy-level Metadata Objectives, population, sampling, methodology, funding or client identifiers, response rates, disposition codes, quality reports, weighting specs. Fewer items, changes, display issues Fewer technical and personnel obstacles AAPOR Transparency Initiative Designed to promote methodological disclosure  Develop simple and efficient means for routinely disclosing research methods by identifying common disclosure elements Unlike the documentation of instrument development and fielding, study-level metadata may be one type of metadata that can be documented by means of a template with single fields for: Objectives, population, sampling, methods, response rates, weighting, funding, etc. Such reporting requires Fewer items to document than in instruments, less granularity than documenting the development and production of a complicated data capture instrument; display issues much less a concern. Fewer technical and personnel issues Much of the study-level metadata is intentionally described once, later in the lifecycle. The AAPOR Transparency Initiative is an approach to promote methodological disclosure of the survey methods (study-level metadata) of publicly-released studies. Members of the DDI community have reached out to the TI personnel, and there may be some promise to create a DDI-compliant template to document study-level metadata.

32 Special Thanks to UWSC Programmers: Eric White Brendan DayApril 6, 2017 – NADDI Conference, Cornell University

33 Thank you! [email protected]This presentation is offered under license CC BY-SA 4.0 April 6, 2017 – NADDI Conference, Cornell University