1 Open Data & Data RepositoriesARD Prasad Indian Statistical Institute Bangalore
2 Precursors Open Source SoftwareOpen Access to Information/ Open Content Open Courseware (eLearning) Open Standards Emerging Open notebook science Open research/ open science/ open science data Open Data/ Linked Open Data I call it Open Mantra
3 Open Notebook Science Practice of making the entire primary record of a research project publicly available online Placing the personal, or laboratory, notebook of the researcher online along with all raw and processed data Opening even failed, less significant, partial and otherwise unpublished experiments; so called 'Dark Data'
4 Why data is not published?Many publications use data. Actual article may not have complete data that has been used Publishers For lack of space (not in in case the Web) Loss of revenue Authors Author might have overlooked the data Author deliberately did not present data - so that others can not verify the data
5 Recent controversy that sub atomic particle moving faster than lightFor Example Some suspect that Sigmund Freud's data was fictitious, it was not just fictitious names – Recent controversy that sub atomic particle moving faster than light
6 Closed Data (not completely open)No data is published Allowing access only on charge Allowing access only to registered users Encrypted data Data requires proprietary tool to access Copyright/license/patent forbidding reuse Not allowing robots/spiders to access data (having CAPTCHA - Completely Automated Public Turing test to tell Computers and Humans Apart) Time-limited access, not allowing bulk downloads Political, legal, commercial pressure on restricting or banning access (Boris Pasternak, Salman Rushdie)
7 Against Open Data Privacy concerns (Recent US controversy)Data was collected using costly equipment or hired manpower Data collected, cured by private organisations should get back their investment
8 Debatable Data in wrong hands How to make crude Bombs ?Some issues related to pornography NSA Vs. Terrorism: Governments, organisations and terrorists can misuse data
9 In Support of Open Data Data belongs to MankindMostly data is generated by Public Money Facts can not be copyrighted Data value will be fully realised if it is widely used, reused Restrictions will result in anti-creative- commons Open data will create more harmony Will accelerate scientific research
10 If data is openly available ...Others may draw different conclusions contrary to that of the author Others may deal with other facets of the data Data Transparency supplements the Objectivity and self correctiveness of Science If “Case history of patients” is openly available, it will contribute significantly to medical research
11 Information/Digital DivideUntil now, publications are being slowly made available through open access (OA) movement OA helped bridging gap in digital divide to a large extent, especially in Humanities and to a lesser extent in Social Sciences and even lesser in Physical and Natural Sciences Physical and Natural Sciences do require laboratory infrastructure
12 Philosophy Data should be freely available with out restrictions such as copyright, patents etc. Exception being classified data
13 Bad side of science Philosophy of Science Sociology of ScienceScience is vastly used for defence purposes (to kill people physically) profit making (to rob people economically) Of course, not without a few good side effects
14 Precursors to Open DataAstronomy data (presently, no boarder disputes in the universe) Data could be easily liberated as publishing industry at least in print form in not a position to print vast quantities of data. But Internet can provide mechanisms to do either way
15 Manifestations of DataEmphasis on Open Data is mostly non-textual as textual information is mostly covered by institutional/digital repositories Examples: maps, genomes, chemical compounds, mathematical and scientific formulae, medical data and case histories bioscience and biodiversity
16 Examples of Data Accelerator data (Nuclear Physics)Weather data (Meteorology) Genome data Statistical data (Govt. or Non-Govt. Data)
17 Amazon Web Services (AWS)Public Data Sets on AWS Annotated Human Genome Data provided by ENSEMBL The Ensembl project produces genome databases for human as well as almost 50 other species, and makes this information freely available. Various US Census Databases from The US Census Bureau Demographic data US Censuses Summary information about Business and Industry Economic Household Profile Data. UniGene provided by the National Center for Biotechnology Information
18 Data repositories Astronomy Sloan Digital Sky Survey DR6 SubsetBiology Influenza Virus (including updated Swine Flu sequence Ensembl Annotated Human Genome Data GenBank Chemistry PubChem Library: A data set of information on the biological activities of small molecules. 3D Version of the PubChem Library UGI Virtual Conformer Library: 500,000 molecules for virtual screening
19 Data repositories – cont...Climate Daily Global Weather Measurements, Economics Federal Reserve Economic Data Transportation Databases Labour Statistics Databases US Census Business and Industry Summary Data
20 Registry of Data RepositoriesPopular Data Registries:- Databib and re3data.org Databib connects to 978 data repositories and databases (agriculture,Geo-sciences,social Sciences,Biological sciences) re3data.org currently lists 634 research data repositories from different disciplines and 586 of these are described in detail using the re3data.org schema. In future, Databib and re3data.org are likely to get merged into one service. Note: The registry entries provide URL to the data repository and also a brief description of it. Manually one has to visit and download the data from the data repository. Again, no protocol to expose metadata of data providers 20
21 Ex. of Open Data in Sciencedata.uni-muenster.de - Open data about scientific artifacts from University of Muenster, Germany. Launched in 2011. linkedscience.org/data - Open scientific datasets encoded as Linked Data. Launched in 2011.
22 Open Data of Governmentsdados.gov.br – Brazilian, 2011 dados.gov.pt – Portugal data.belgium.be – Belgian data.gc.ca – Canada, 2011 data.gouv.fr – France, 2011 data.gov – U.S data.gov.au – Australia, 2011 data.gov.in – India, 2012 data.gov.it – Italy, 2011 data.gov.uk – U.K data.govt.nz – New Zealand, data.gv.at – Austria data.norge.no – Norway, data.overheid.nl – Netherlands, daten- deutschland.de - German datos.gob.cl – Chile, datos.gob.es – Spanish2011. geodata.gov.gr – Greece, 2010, open- data.europa.eu - EC Data Portal. opendata.go.ke – Kenya, 2011. opengovdata.ru – Russia, 2010 satupemerintah.ne t - Indonesia
23 Orgs. Promoting Open Datafreeourdata.org.uk Open Data in UK Open Data Institute Open Knowledge Foundation Scholarly Publishing and Academic Resources Coalition Sunlight Foundation LinkedScience.org Talis w3.org Blue Obelisk Freebase Factual Information Retrieval Facility Socrata IDRC OMG standard CiteSeer Knoema Ecodesk
24 Role of Librarians Facilitating Data Reusability by making it Interoperable and Discoverable following standards and protocols
25 Features of Open Data RepositoriesMetadata: specify who is the owner, creator etc License the data to waive your rights to facilitate bulk download Open Data Technology Tools: Cloud platform: storage and retrieval Mining: automate data extraction Ontology: Indexing and linking
26 Four Facets of Big Data Data Curation (LIS professional)Technology for Big Data (Computer Professional) Data Analytics & Visualisation (Statistician) Above All: Domain Expert
27 Data Curation Data Curation makes data processable by Humans Machines
28 Typical LIS Professional WorkAcquisition Organisation Classification (Ontology) Cataloguing (Metadata) Storage Retrieval Dissemination Weeding out Long Term Preservation Publishing In the context of Internet
29 Cann't we treat Big Data similarly?Acqusition of Big Data structured & Unstructured Linked Open Data Organisation metadata & Ontologies Retrieval & Dissemination Publishing Big Data on Computer Clusters Long Term Preservation Data Driven Decision Support System Supporting Data Analytics, Visualisation
30 Digital Curation Collecting verifiable digital assetsProviding digital asset search and retrieval Certification of the trustworthiness and integrity of the collection content Semantic and ontological continuity and comparability of the collection content Use of open standards (formats) for term preservation and future proofing by migration of data
31 Three V's of Big Data Volume (Terabytes to Zettabytes)Veriety (structured & unstructured) Velocity (Batch processing to Streaming Data) Veracity (biases, noise and abnormality) Validity (trustworthyness) Volatility (current or obsolete)
32 Volume Data generated by humans Data generated by Machines SensorsSatellites Networks
33 Technology Data repositories are much larger than OA repositoriesCloud Computing is a good solution (AWS uses) Semantic Web & Linked Data (Linking Data through various methods)
34 Licences Creative Commons licenses (apart from CCZero), GPL, BSD, etc are NOT quite appropriate for open data licences
35 Open Data Licences Open Data Commons Public Domain Dedication and Licence (PDDL) Dedicate to the Public Domain (all rights waived) Open Data Commons Attribution License Attribution for data(bases) Open Data Commons Open Database License (OdbL) Attribution-ShareAlike for data(bases) Creative Commons CCZero
36 Open Source Software for Data Repositories
37 CKAN: Comprehensive Knowledge Archive Network (http://ckan.org)An open source data portal software for national, sub national, companies, institutes and organisations to publish their data Used by British Library, the journal Nature, many national and local governments in UK, the Netherlands, Brazil, US Written in Python and javascript, postgreSQL database, SOLR for search CKAN features include data harvesting, faceted search, and interfaces to data and metadata, and federation
38 Dryad Based on DSpace – submission, review workflow with embargo feature to postpone the publication, DOI Supports long term preservation Partnership with DataOne
39 DataVerse Collaborative Work of Institute for Quantitative Social Science (IQSS), Harvard Library, and Harvard University Information Technology (HUIT) Researchers can host data and make it discoverable to other researchers Journal publishers can encourage authors to upload data along with their article, thereby ensuring a link between the data on which the article is based on. Organisations can set up institutional data repositories using dataverse As on 2016, Dataverse has about 19 installation having more than 61,000 datasets. Has many APIs that facilitate uploading, searching and accessing data.
40 DRTC Projects Living Knowledge (EU funded project. completed)ITPAR: India-Trento Program for Advanced Research (Semantic Web) AgInfra (EU funded Project): Data Infrastructure and Services to Empower Agricultural Research Communities
41 DRTC Activities on Big DataConducted 2-week International workshop with ICSU/CODATA in March, 2015 Co-chairing session on Agricultural Data in RDA (Research Data Alliance) conferences Preparing guidelines for UNESCO on 'Publishing Data' Will be conducting an International Workshop on 'Data Repository Software' (Feb/March, 2016) Developing a metadata for data called 'PROMIS' (Processing Metadata Inititiative Schema) Test sites of CKAN, DataVerse, Dryad software
42 Data Science Master of Information & Data ScienceUC, Berkley Post Graduate Diploma in Business Analytics: Indian Statistical Institute (Kolkata) + Indian Institute of Management (Kolkata) + Indian Institute of Technology (Karaghpur) DRTC is incorporating Data Science in its MS (LIS) 42
43 Conclusion Researchers do generate data from experiments and surveysResearchers may use already existing data for further analytics and interpretation They may not publish the entire data generated or collected Their data should be made available in ETD repos for further examination or reuse
44 Thank You 44