Overview brief introduction of my background

1 Momentum towards an efficient data economy Peter Wit...
Author: Blake Hamilton
0 downloads 0 Views

1 Momentum towards an efficient data economy Peter Wittenburg Max Planck Computing and Data Facility

2 Overview brief introduction of my backgroundre-use and combination of data (change of paradigm in science – 3 examples) problems when integrating data and new challenges core components to achieve higher efficiency and example of one Core Component Global Digital Object Cloud and Type-Triggered Automatic Processing as concepts next steps

3 active at MPI for PsycholinguisticsTrained as an electronic engineer, I worked for many years as director of the department responsible for technology and methodology at the Max Planck Institute for Psycholingistics. The core task of the institute was and is to understand how our human language faculty works, i.e., how does our brain process and learn language. From its begining the institute was experimentally oriented, i.e., all kinds of observations and experiments were used to better understand the functioning of our brain. During the last 4 years I work as senior advisor for data systems at the Max Planck Computing and Data Facility. From 2000 on I was active in leading roles in European, national and international projects and initiatives. Head of Technology and Methodology Department Understanding Human Language Capacity from the beginning data as basis since 4 years at Compute & Data Facility der MPG (MPCDF) Coordinator of RDA Europa

4 NoMaD infrastructure (FHI Berlin+MPCDF)Novel Materials Discovery project Computational material science many Labs create data about materials and compounds (experiments + simulations) space of Chemical compounds is endless until now: Publications of Papers (also infinite) now: how can we categorise space to quickly find useful compound materials from Periodical system to multi-dimensional map of compound material categorisation via Machine Learning etc. But: Integration of data from many labs worldwide required And: how to make integration inefficient? how to correctly refer? who is going to use this aggregated data space? etc. Revolution: writing paper is not the only scientific goal anymore In material science two projects focus on global infrastructures to make it much easier to quickly determine which (compound) material would be most suitable for a certain new application. In the US the Materials Genome Initiative is amongst others aggregating in particular experimental data from many different labs and in Europe it is the NOMAD project that is harvesting in particular simulation data from all over the world. The goal is to use this aggregated data to find hidden structures and perhaps extend the periodic systems of chemistry by a new multidimensional classification system. It is interesting that science is now not using data only to create scientific papers, but to re-use data for new purposes unforeseen beforehand. Problematic is that the aggregated data is not easy to combine due to the fact that, for example, the associated metadata descriptions are of poor quality.

5 DOBES Project on Endangered Languages~70 global teams One central archive ~80 TB in online archive Web-based, open deposit 4 dynamic external copies remote archives The DOBES project on documenting endangered languages is another project in which much data is being aggregated, in this case by more than 70 teams working on many languages worldwide. At the MPI for Psycholinguistics about 80 TB of data about languages have been aggregated, and this allows completely new types of analysis across languages such as "how did languages in certain regions evolve across thousands of years" or "are certain languages more easy for our brain to process than others". Also here we could observe a revolution in so far as linguists realized that it makes sense to offer the data they recorded to researchers globally. New questions similar to the NOMAD case were raised: How can we make data integration and re-use much more efficient? How to correctly refer to data when data is being re-used in different contexts? Who is allowed to make use of all this aggregated data? how can one use data to validate theories about the evolution of languages (and cultures) over thousands of years how to understand which languages are more "economic" than others also here: Integration of much data from many labs worldwide and: how to make integration more efficient? how to correctly refer to data, who is going to use all the aggregated data? etc. Revolution in humanities: scientific paper is not only goal anymore

6 Human Brain – robust & fragileDas Human Brain Project soll neue Instrumente zur Verfügung stellen, um das Gehirn und seine grundlegenden Mechanismen besser zu verstehen und dieses Wissen in der Medizin und in der Informatik der Zukunft anzuwenden. Informations- und Kommunikationstechnologien werden im Projekt eine zentrale Rolle spielen. Es sollen geeignete Supercomputing-Plattformen entwickelt werden, um neurowissenschaftliche Daten aus aller Welt für Modelle und Simulationen des Gehirns aufzubereiten. Für die Neurowissenschaft entsteht somit eine gemeinsame Basis, um Informationen auf der Ebene Gene, Moleküle und Zellen mit dem Denken und Verhalten des Menschen zu verbinden. In ähnlicher Weise soll eine neuartige Medizininformatikplattform klinische Informationen aus aller Welt für Computermodelle von Erkrankungen nutzbar machen, um damit Techniken für die objektive Diagnose von Gehirnerkrankungen zu entwickeln, die ihnen zugrunde liegenden Mechanismen zu erforschen und die Entwicklung neuer Therapien zu beschleunigen. Ein weiteres Projektziel ist die Nutzung eines besseren Verständnisses der Arbeitsweise des Gehirns zur Weiterentwicklung der Informations- und Kommunikationstechnologien. Hier stehen verbesserte Energieeffizienz und Zuverlässigkeit sowie die bessere Beherrschung der Programmierung komplexer Rechnersysteme im Vordergrund. Human Brain – robust & fragile continuous increase of brain diseases how can we detect their causal basis, how to detect them early, how to medicate them? machine learning allows to correlate patterns in data (Brain images, Genes, Proteins, Reactions, etc.) with phenomena but: much data from various specialized labs and hospitals is required and: how to make integration efficient, how to correctly refer to data, who is going to use all aggregated data, etc. A third example can be taken from brain research. We can observe that brain diseases are increasing and that it seems to be impossible to determine the causes of such diseases with traditional scientific methods. Therefore, much data has to be aggregated in projects such as the Human Brain Project from different expert labs and hospitals so that one is able to correlate patterns in different data sets (gene data, protein data, neuroimaging data, reaction data, etc.) with phenomena. Here machine learning techniques are being used to identify typical signatures of such diseases enabling early predictions, early treatments etc. Big judicial and ethical challenges need to be addressed, since it must be possible to transfer sensitive data to trustworthy centers to enable such calculations. Ferath Kherif HBP Revolution in medical world: how to make essential data available outside of the hospitals? how to cope with judicial & ethical problems?

7 Interoperability DimensionsAs indicated, data aggregation and integration is currently very inefficient and costly. Terms need to be negotiated with the different data providers under which conditions data may be accessed, transfered and re-used. Unclear rights situations make this step already a time-consuming endeavour. If agreements have been achieved, one is confronted with the incompatibilities and lack of quality in the data itself. The organisation of data is different (data models, linking between data and metadata, etc.), structures are different and partly unkown, metadata is of bad quality etc. From industry it is known that 60% of the effort in data projects is spent on overcoming these interoperability problems. Often data needs to be processed in distributed infrastructures where federation aspects need to be addressed by detailed agreements. Finally, effort needs to be spent on making data reproducible, on being able to uniquely refer back to data from publications, etc.

8 Can we simply continue ? Noooooo, because ...our data landscape is fragmented - only little fits together (Identification, organization and description of data, storage systems, etc.) 80% of all created data no longer accessible after short time periods 80% of the time of expensive data scientists is wasted on typical data management tasks data volumes and complexity will increase extremely due to new developments (in science and industry) we are not fit for this new phase! (one of the reasons for RDA) The question can and must be asked whether we can continue the way we are working right now with data and the answer is simply NO. Powerful institutes or companies are able to spend the huge effort to do data-driven science and overcome all mentioned problems. However, we spend too much money on this, many opportunities hidden in data are simply not exploited and many institutes and companies are excluded from data intensive work. Our data landscape is fragmented at all layers from data creation on. 80% of the created data in science is no longer accessible after short period of times and 80% of the time of expensive data scientists is wasted on typical data management tasks. With the upcoming wave of billions of smart devices all generating continuous streams of data with high time resolution we recognise that we are just at the beginning of the data tsunami and that we are not ready to cope with the challenges of the coming phase. 50 billion Smart Devices will create true data monsters.

9 Development of Devices and DataSources from Intel (millions of devices) and Oracle (exabytes) indicate which developments are expected for the coming years.

10 Fundamental Change Through IoT IHumans Cyber infrastructure In particular, the Internet of Things changes the rules of the data game even more dramatically. Let's assume three domains: physical objects of this world, human beings and some cyberinfrastructure. Physical Objects adapted from Chris Greer, NIST

11 Fundamental Change Through IoT IIHumans Actors Mediators Internet WWW etc. Until now humans were widely the mediator between the cyberinfrastructure and the world of physical objects, i.e., humans widely controlled all interactions. Physical Objects adapted from Chris Greer, NIST

12 Fundamental Change Through IoT IIIHumans often bypassed Cyber infrastructure In the coming of IoT we see that humans will increasingly often be by-passed, since physical objects are connected to the cyberinfrastructure and will increasingly often exchange data with other physical objects directly. Much of the data needs to be stored in repositories to enable processing for optimisation purposes etc. PO directly acting adapted from Chris Greer, NIST Cyber

13 Fundamental ObservationScientific Analytics Leave flexibility Even more opportunities Reduce heterogeneity & costs Make solutions stronger Achieve sustainability Management Curation Access One observation we can make is that despite all differences between scientific disciplines in the way they generate and analyse data, the way they manage, curate and access data is widely making use of the same core components across disciplines and sectors. Currently, thousands of initiatives and projects create their own widely different variants of core components such as for persistent identifier systems, for distributed authentication and authorisation, for repositories, registries, etc. There are many of such core components where common solutions would be possible and would help overcoming the fragmentation. The message we can extract from this is that we need a drasti reduction of the current solutions space. Leave flexibility Even more opportunities PID, AAI, MD, WF, Registries, Repositories, meta-semantics, etc. Scientific Creation

14 Global und Persistent IDs as AnchorsOne such example of a need for a common solution is the PID system. After 20 years of discussions the requirements have been sorted out and we could realize a common worldwide solution which can be compared with the IP system for connecting computers. Each Digital Object (data, metadata, software, configurations, etc.) should have its own PID which in principle would allow us to speak about “registered domains of digital objects” and "networking digital objects". It is increasingly obvious for many of us in the Research Data Alliance and beyond that we need another surge of "momentum" to overcome the huge fragmentation. Starting with PIDs could be again the kick-off point. PID System is a catalyzer to define a new basis and to come to new services we need a change – need a momentum.

15 RDA DFT – Simple powerful data modelData Foundation and Terminology Core model is very simple. If all software developers would implement this model, we would get an enormous increase in efficiency. Deviations can become very expensive. . Based on many use cases from different disciplines and initiatives the RDA Data Foundation and Terminology Working Group defined a basic core model which could be adopted as a unifying generic model. At its core it simply states that each digital object may have a bit sequence which should be stored in trustworthy repositories, has been assigned a PID and has been associated with useful metadata. It also states that metadata and collections are also Digital Objects. If all software developers would adhere to this simple data model we would already have achieved a lot towards interoperability.

16 PID System and State InformationPID Record PID PID CKSM PID PID paths PID Increasingly, in RDA and beyond, data professionals see the PID as the anchor-point of a persistent data landscape. If we assume that PIDs are persistent and can be resolved to useful information over long periods of time then it makes sense to add crucial state and binding information about the DO to the PID record. Of course one should add access path information to the copies of the bit-sequences to the PID, but also information such as the checksum to prove identity and integrity at all times, pointers to the metadata description, to landing pages, to the access rights records in distributed scenarios, etc. is highly valuable. Some repositories also want to store pointers to earlier and follow-up versions within the record. Currently, an RDA Working Group is defining a core set of such information types. The rule must then be that repositories need to decide which types they are using and that they use those types that have been defined and registered in type registries to achieve interoperability. Metadata Rights data copies Relations Provenance

17 Worldwide Handle SystemIndependent Swiss Foundation DONA Board of International Experts Redundant network of root nodes Contracts We are making ourselves very much dependent on the availability and persistence of a PID System which needs to be global, robust, highly available, well-governed and must allow a PID to be resolved to useful state information. Many of us believe that we already have such a system: the Handle System has evolved over the last 20 years and has reached a state of maturity allowing us to rely on it. It is now hosted by the Swiss DONA foundation registered in Geneva and governed by the International DONA Board including an increasing number of experts from various countries. It‘s robustness is guaranteed by a redundant multi-node root system which includes an increasing number of powerful centres from around the globe. These nodes act also as registration authorities making contracts with academic, industrial and other service providers. One of the stakeholders is the International DOI Foundation which has always been Handle based. So there is a system we can rely on and in which it makes sense to invest in order to support this infrastructure at global level so that everyone can use it. CNRI CHSC GWDG IDF CITC SA RU Services in Germany EPIC DataCite

18 Towards a Global Digital Object Cloudtaken from Larry Lannom, CNRI Given the existence of a stable PID system we can start thinking of maximizing its use for the benefit of making data practices much more efficient. We envisage a system where users primarily only deal with the fingerprints of Digital Objects, PID record information and metadata, to form for example collections to carry out some management or analysis operation. The user does not care which services in the „cloud of registries and repositories“ provide PID, metadata, access rights etc. information and they also do not care how finally the bit sequences are accessed to perform operations. The basic principles of Global Digital Object Cloud are not new, since it is about virtualisation. However, it is now time to systematically create a domain where this can be done at global level for all types of users independent of proprietary solutions. some work with limited funds is already being carried out – US und EU are leading

19 Towards Type-Triggered Automatic Processingmassiveness of data streams and wish to re-combine data requires radical shifts Agents should react on incoming data which are suitable for the specific business case digital objects “find themselves” Data Events Structured Data Markets Basis are Digital Objects (Data, Software, Configurations, etc.) and Types In addition to the innovative scenario of the Global Digital Object Cloud we can foresee that, due to the vastly increasing data volumes and complexity created by billions of smart devices, our way of acting on data will need to change dramatically. The user of data – be it in science or industry – will need to specify profiles of data that are of interest for his/her analysis. Agents then must scan web-locations where data repositories indicate which new types of data have been received and are available for access to experts in trust federations. When type-defined profiles match with some new data being offered type processing services will be activated. The results may be stored in repositories and offered back as new data types ready for further processing. Researchers, for example, in this process do not operate procedurally anymore but declaratively, since they only specify the profiles that may trigger actions. Such a scenario would be truely automatic. Either the scientist or some services provider will create the necessary scientific algorithms that process the data. Data Federation Agents Data Type Registry Processing services result scripts

20 What are others doing? US: Start of an intensiveinteraction between science and economy about Data- Implications of IoT/CPS ITU: Understanding the necessity of a global solution for the identification of DO - comparison between Bar Codes, DOA, etc. China: Implementation of a Food Supply Chain Control Solution based on Handles China: Discussion of a national solution for the identification of DO – available for science and industry Workshop at 6.6 at the IoT Week in Geneva Outside of RDA a number of activities have been started to look further into issues that are related with what has been shown in the previous slides. In the US there was recently a workshop addressing the additional requirements introduced by IoT. ITU is comparing methods for robust identification. In China PIDs have been used systematically to implement an industrial Food Supply Chain Control solution and a national solution for identification is being disucssed. A workshop at the IoT Week at 6.6 in Geneva will address global solutions as well. This is just a small selection of events indicating convergence of discussions.

21 What next? need a well supported and persistent global PID registration and resolution infrastructure available for everyone and support for developing layered services need exchange platforms across borders (disciplines, sectors, nations) to interact about implications, standards, technologies etc. need a potential for building a testbed for the Global Digital Object Cloud urgently need education of data managers and data scientists at all levels including industry need to support data entrepreneurship to get young people interested in making smart use of huge data sets There are a number of different steps that need to be taken urgently to overcome the barriers in efficient data practices. This slide just presents a selection of them that was recently communicated to the German Ministry for Education and Research. We have a global PID resolution system fulfilling essential requirements, but we need to make sure that there are funds to allow the service providers to offer professional support and to add layered services making optimal use of the potential of PIDs. We need to continue and probably intensify interactions at various levels and across disciplines and sectors to come to decisions which will narrow down the solutions space drastically. There should be funds to build a comprehensive testbed for the Global Digital Object Cloud, since the current implementation work is not sufficient to show its full potential. The same holds for the Type-Triggered Automatic Processing concept – we need to start implementing and testing as soon as possible. Of course we need much effort to train data managers and data scientists and to also engage a new generation of entrepreneurs investing in data re-usage.

22 Thanks for your attention.IoT Week workshop Genf: RDA Data Fabric IG: https://www.rd-alliance.org/group/data-fabric-ig.html RDA Plenary P9: Barcelona: https://www.rd-alliance.org/plenaries/rda-ninth-plenary-meeting-barcelona