1 Research Objects for improved sharing and reproducibility Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology Oscar Corcho @ocorcho, Ontology Engineering Group Universidad Politécnica de Madrid (and the Research Object community group)
2 My motivation
3 Some memos from our futuristic scenarioDon’t publish, release (ack: Carole Goble), reloaded (ack. Paul Groth) Don’t just read a paper, but also view it, play with it, and whatever else Convert passive papers into active scientific storytellers and alert systems
4 A few quotes from this weekData (and method) sharing Dietrich: The method for investigation is not clearly described Eric: Provide links between articles and datasets (interlinking of scholarly content) William: methods are normally reduced to a tiny piece of text Reproducibility Working group on “the present”: Crisis of replicability is driving increased concern and interest Eric: 70% of science articles are not reproducible
5 Act 1 Data and method sharing
6 One of the many origins of “Don’t Publish, Release”A day in Granada… (January, 2012) Let’s get some of the interesting discussions on the Force11 Dagstuhl meeting into practice
7 One of the origins of “Don’t Publish, Release”Live RO Live RO Scientist My supervisor calls me to report my work My supervisor calls me again and we decide to publish our RO+paper Reviews received and final version published A new PhD student continues my work <
8 How do you usually structure your experiment?In a set of folders? These could be profiles for how you normally structure your research Dropbox? Google Drive? GitHub? Overleaf+figshare? Whatever???
9 Scattered Assets Slideshare Community db Github figshare Arxiv.org
10 A Framework to Bundle, Port and Link (scattered) resources, related experiments. Metadata Objects that carry Research Context. Units of exchange. Research Objects Multi-various products, platforms, resources First class citizens - id, manage, credit, track, profile, focus
11 RO main principles Identity Metadata Description AggregationRefer to aggregations and their contents Identity Interpretation: The objects How they are linked together Metadata Description Describe group & constituents External ids Local files Aggregation Attribution: Who , when, where, why? manifest
12 RO main principles: technologiesDOIs URIs Handles ORCID Identity persistence and resolution, Names Citation Identity W3C OADM OAI- ORE Annotation first class and stand-off Annotation Aggregations Resource maps Proxies Aggregation Packaging – physical and logical containers Open Archives Initiation Object Reuse and Exchange (OAI ORE) is a standard for describing aggregations of web resources Uses a Resource Map to describe the aggregated resources Proxies allow for statements about the resources within the aggregation Capturing context and viewpoints Several concrete serialisations RDF/XML, Atom, RDFa Open Annotation specification is a community developed data model for annotation of web resources Developed by the W3C Open Annotation Community Group Allows for “stand-off” annotations Annotation as a first class citizen Developed to fit with Web Architecture Point of extendability manifest
13 RO principles Use unique identifiers as names for things Use some mechanism of aggregation to group things together Provide metadata about those things & how they relate to each other.
14 RO Model Ontology Defines core concepts of research objects, identity, aggregation, annotation. Used in the manifest
15 RO Model Ontology
16 Manifest – remote and localon my machine
17 https://researchobject.github.io/specifications/bundle/Export, archive, publish and transfer ROs. File format for storage and distribution of ROs as a ZIP archive Includes an RO’s manifest, annotations and some or all of its aggregated resources Basis for more specific file formats Backwards compatible: its zip Programmatic access: JSON and JSON-LD manifest, API Capture a Research Object to a single file or byte-stream by including its manifest, annotations and some or all of its aggregated resources for the purposes of exporting, archiving, publishing and transferring research objects. https://w3id.org/bundle/ doi: /zenodo.10440
18 https://researchobject.github.io/specifications/bundle/Capture a Research Object to a single file or byte-stream by including its manifest, annotations and some or all of its aggregated resources for the purposes of exporting, archiving, publishing and transferring research objects. So not everyone have access to set up a RESTful semantic web servers, in particular we’ve run into this with desktop applications – users just want to save files and then they decide where they are stored. So we decided to write a serialization format for Research Object, which we call the RO Bundle. We wanted this to be accessible for application developers, so we’ve adopted ZIP and JSON, and in a way this would let you create research objects and make annotations without ever seing any RDF. https://w3id.org/bundle/ doi: /zenodo.10440
19 Containers
20 Research Objects: Scopes and Tooling Farr Commons: ISA and FAIR-DOM SEEK COMBINE BagIt (soon) White-labelled sci-domain-independent software Core Ontologies and extensions RO managers/APIs/bundling (Ruby, Java, Python) Latex2RO LDP4RO
21 Publishing may be as easy as…Providing the URL of the Research Object to the publisher, with a release tag, to start the review process (if extra review needed)
22 Act 2 Reproducibility
23 Terminology Preservation Keep it in a perfect/unaltered condition.Preserving the integrity and authenticity. Conservation Action of prolonging the existence of significant objects. Researching, recording and retaining all information related to the object. Documenting Restoration Return something to an earlier condition Reconstruction Forming again, with improvements or removal of defects “Two opposing factions had emerged within the environmental movement by the early 20th century: the conservationists and the preservationists. The conservationists (such as Gifford Pinchot) focused on the proper use of nature, whereas the preservationists sought the protection of nature from use.[9] Put another way, conservation sought to regulate human use while preservation sought to eliminate human impact altogether.” Inspired by [Goble, 2012]
24 Terminology Preservation Inspired by [Goble, 2012]
25 Terminology Preservation Conservation Inspired by [Goble, 2012]
26 Terminology Preservation Restoration ConservationInspired by [Goble, 2012]
27 Terminology Preservation Restoration Conservation ReconstructionInspired by [Goble, 2012]
28 The Research Method in different disciplinesIN VIVO/VITRO INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT IN SILICO
29 The Research Method in different disciplinesLaboratory Protocol (recipe) Experiment Lab book Workflow Digital Log This is the What: detect common groups of tasks. vs How: exact and inexact FGM techniques vs Why? T. 29
30 The Research Method in different disciplinesIN VIVO/VITRO INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT IN SILICO
31 Some problems in lab protocolsIncubate the centrifuge tubes in a water bath. Incubate the samples for 5 min with gentle shaking. Rinse DNA briefly in 1-2 ml of wash. Incubate at -20C overnight. some of them present insufficient granularity, the instructions can be imprecise or ambiguous due to the use of natural language.
32 Currently… How to formalize the information from laboratory protocols as a knowledge base? Semi-structured information Ontologies + NLP tools Unstructured information
33 SMART Protocols - documentRhetorical and structural components (e.g. introduction, materials, and methods); Information like application of the protocol, advantages and limitations, list of reagents, critical steps.
34 SMART Protocols - wf Representation of the workflow aspects in protocols implicit order in the instructions, following the input output structure.
35 SMART Protocols documentationSMART Protocols ontology is available here: Giraldo O, García-Castro A, Corcho O. SMART Protocols: SeMAntic RepresenTation for Experimental Protocols. LISC2014 The ontologies are available here and recently were accepted a paper in the workshop linked science 2014 where is describing the ontology design. So far, we have covered a way about how to report formally a lab protocol.
36 SMART Protocols in actionrdf:type sp:title of the protocol rdf:type sp:author entry sp:hasTitle sp:hasAuthor sp:experimental protocol sp:DNA extraction protocol owl:subClassOf ro:partOf rdf:type sp:advantages ro:partOf sp:sample rdf:type ro:partOf sp:application of the protocol rdf:type sp= smart protocols, ro= relation ontology
37 SMART Protocols in action
38 The Research Method in different disciplinesIN VIVO/VITRO INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT IN SILICO
39 Vocabularies and methodologies for representing and publishing workflowsWorkflow Provenance Workflow Plan Methodology for workflow publishing Interactive Browsing (Pubby frontend) Programatic access (external apps) Wings workflow generation OPM/PROV conversion Publication Share Reuse Core Portal WINGS on local laptop Workflow Template Workflow Instance PROV export WINGS on shared host WINGS on web server Linked Data Publication Users Other workflow environments RDF TripleStore Repository of linked workflows: Daniel Garijo and Yolanda Gil A new approach for publishing workflows: abstractions, standards, and linked data. (WORKS '11). ACM, New York, NY, USA, Daniel Garijo and Yolanda Gil. Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data. In Proceedings of the 2nd International Workshop on Linked Science 2012, Boston, 2012. 39
40 Definition of workflow abstractionsCatalog of common independent workflow abstractions (motifs) Data-oriented motifs: What kind of manipulations does the workflow have? Workflow-oriented motifs: How does the workflow perform its operations Analysis from 260 different workflows from 10 domains analyzed belonging to 5 different workflow systems Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble, Common motifs in scientific workflows: An empirical analysis, Future Generation Computer Systems, Volume 36, July 2014, Pages 40
41 Finding and evaluating common abstractionshttps://github.com/dgarijo/FragFlow Graph mining techniques Workflow fragment Filtering techniques Workflow fragment representation and linkage Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris A.Gutman,Ivo D. Dinov, Paul Thompson, and Arthur W. Toga. FragFlow: Automated Fragment Detection in Scientific Workflows. In The 10th IEEE International Conference on e-Science, Guaruja, 2014 41
42 How to preserve Workflows/Research Objects?Three main ways/levels: Descriptive reproducibility Documentation Workflow execution reproducibility Can we run the workflow? Workflow results reproducibility Can we get the same results? Checklists! Corcho et al: Checklist for workflow conservation. 40 different aspects Goals Results Metadata Corcho et al: Checklist for a workflow conservation plan Based on the DCC’s data management plan
43 Some examples Levels of reproducibility Workflow conservation Plan
44 The Research Method in different disciplinesIN VIVO/VITRO INPUT DATA SCIENTIFIC PROCEDURE EQUIPMENT IN SILICO
45 Reproducibility of Computational Scientific ExperimentsSEMANTIC ANNOTATIONS EQUIVALENT EXECUTION ENVIRONMENT FORMER EQUIPMENT ANNOTATE REPRODUCE CLOUD Dispel4Py Internal Extinction Seismic Cross Correlation Pegasus Montage SoyKB Epigenomics Makeflow Blast
46 Our Approach to Experiment ConservationWICUS Framework overview This is an overview of the system we propose. WICUS stands for Workflow Infrastructure Conservation Using Semantics…
47 Pegasus Montage WorkflowSome results Pegasus Montage Workflow Astronomy workflow Construct large image mosaics of the sky Montage Software distribution 59 binaries Target IaaS Cloud Providers Amazon EC2 & Futuregrid Vagrant RO available at
48 Lessons learned for AnnaResearch Objects as a concept Identity, annotation, aggregation Adapted to the tools/infrastructure for each domain With some tooling available already It’s not just data preservation but also methods Lab protocols Computational workflows Understand what reproducibility means for you
49 Research Objects for improved sharing and reproducibility Dagstuhl Perspective Workshop on the intersection between Computer Sciences and Psychology Oscar Corcho @ocorcho, Ontology Engineering Group Universidad Politécnica de Madrid (and the Research Object community group)
50 The Semantic e-Science team at UPMAcknowledgements The Semantic e-Science team at UPM Carlos Badenes Daniel Garijo Olga Giraldo Rafael González-Cabero Idafen Santana The Wf4Ever team Carole Goble, José Manuel Gómez Pérez, Raúl Palma, Jun Zhao, Stian Soiland-Reyes, Khalid Belhajjame, José Enrique Ruíz, Marco Roos, Lourdes Verdes-Montenegro, Norman Morrison, Sean Bechoffer, Graham Klyne, Matt Gamble, and a large etcetera The Research Object community group