Tony Hey Corporate Vice President Microsoft Research

1 Tony Hey Corporate Vice President Microsoft ResearcheSc...
Author: Andrew Flynn
0 downloads 4 Views

1 Tony Hey Corporate Vice President Microsoft ResearcheScience, Semantic Computing and the Cloud Towards a Smart Cyberinfrastructure Tony Hey Corporate Vice President Microsoft Research

2 eScience

3 A Data Deluge in ScienceData collection Sensor networks, satellite surveys, high throughput laboratory instruments, observation devices, supercomputers, LHC … Data processing, analysis, visualization Legacy codes, workflows, data mining, indexing, searching, graphics … Archiving Digital repositories, libraries, preservation, … SensorMap Functionality: Map navigation Data: sensor-generated temperature, video camera feed, traffic feeds, etc. Scientific visualizations NSF Cyberinfrastructure report, March 2007

4 Emergence of a New Research Paradigm?Thousand years ago – Experimental Science Description of natural phenomena Last few hundred years – Theoretical Science Newton’s Laws, Maxwell’s Equations… Last few decades – Computational Science Simulation of complex phenomena Today – eScience or Data-centric Science Unify theory, experiment, and simulation Using data exploration and data mining Data captured by instruments Data generated by simulations Data generated by sensor networks Scientists overwhelmed with data Computer Science and IT companies have technologies that will help (With thanks to Jim Gray)

5 Today Web users... Scientists... Generate content on the WebBlogs, wikis, podcasts, videocasts, etc. Form communities Social networks, virtual worlds Interact, collaborate, share Instant messaging, web forums, content sites Consume information and services Search, annotate, syndicate Scientists... Annotate, share, discover data Custom, standalone tools Conferences, Journals Publication process is long, subscriptions, discoverability issues Collaborate on projects, exchange ideas , F2F meetings, video-conferences Use workflow tools to compose services Domain-specific services/tools

6 Open Collaboration NSF Advisory Committee on12/9/ :24 PM Open Collaboration “In order to help catalyze and facilitate the growth of advanced CI, a critical component is the adoption of open access policy for data, publications and software.” NSF Advisory Committee on Cyberinfrastructure (ACCI) Open access Open source Open data Microsoft Interoperability Principles Open Connections to Microsoft Products Support for Standards Data Portability Open Engagement © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

7 Today… Computers are great tools for huge amounts of datastoring computing managing indexing Computers are great tools for huge amounts of data For example, Google and Microsoft both have copies of the Web for indexing purposes

8 Tomorrow… Computers will still be great tools for huge amounts of datastoring computing managing indexing Computers will still be great tools for huge amounts of data acquisition discovery aggregation organization correlation analysis interpretation inference We would like computers to also help with the automatic of the world’s information

9 Semantic Computing

10 Need for Semantic Computing?Semantic computing combines concepts and technologies that Enable data modeling Capture relationships Allow communities to define ontologies Exploit machine learning Will empower computers to reason about the data Data Information Knowledge Current technologies Possibilities for innovation

11 Semantic Computing Some efforts are driven by the traditional “knowledge engineering” community Engaged in building well-controlled ontologies Important for domain-specific vocabularies with data formats and relationships specific to a community Model does not easily scale to the Internet Some efforts are driven by the Web 2.0 community Focus on the pervasiveness of Web protocols/standards Emphasis on microformats (small, flexible, embeddable structures) Exploit evolving and ever-expanding vocabularies such as folksonomies and tag clouds

12 Semantic Web as the platform?12/9/ :24 PM Semantic Web as the platform? Mark Butler (2003) Is the semantic web hype? Hewlett Packard laboratories presentation at MMU, Mark Butler (2003) Is the semantic web hype? © 2004 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

13 Cloud Computing

14 Rationale for Cloud computingOutsourcing of IT infrastructure Minimize costs Large cloud/utility computing provides can have relatively very small ownership and operational costs due to the huge scale of deployment and automation Small businesses have access to large scale resources The acquisition, operation, and maintenance costs would have been prohibiting

15 Example: Amazon Web ServicesSimple Storage Service (S3) storage for the Internet Simple Web Services interface to store and retrieve any amount of data from anywhere on the Web SimpleDB Structured data Simple Queue Service Scalable message queuing Elastic Compute Cloud (EC2) Compute on demand Virtualization Integration with S3 Gene Analysis Virtual Lab Experiment by Jong Youl Choi at Indiana (Beth Plale and Sun Kim) Standards-based REST and SOAP Web Service interfaces

16

17 Microsoft Cloud ServicesExchange Live ID Xbox Live SQL Server Data Services Office Live Workspaces Windows Live Live Mesh .NET Online Many more coming

18 eScience and Cloud Computingin action

19 The SkyServer Project Jim Gray (MSR) and Alex Szalay (JHU)The Sloan Digital Sky Survey (SDSS): The “Cosmic Genome Project” 5 color images of ¼ of the sky Pictures of 300 million celestial objects Distances to the closest 1 million galaxies Built the public archive for the SDSS Interesting challenge in digital publishing Have to publish first in order to analyze

20 Public Use of the SkyServerPosterchild in 21st century data publishing 380 million web hits in 6 years 930,000 distinct users vs 10,000 astronomers 1600 refereed papers! Delivered 50,000 hours of lectures to high schools Delivered 100B rows of data World’s most used astronomy facility for last 2 years

21 GalaxyZoo Goal of 1 million visual galaxy classifications by the public Enormous publicity (CNN, Times, Washington Post, BBC) 100,000 people participating, blogs, poems … Application is like Amazon’s ‘Mechanical Turk’ Web Service that allows users to search for photographs …

22 Hanny’s Voorwerp

23 World Wide Telescope Seamless Rich Social Media Virtual SkyWeb application for science and education Participants Alyssa Goodman; Harvard University Alex Szalay; Johns Hopkins University Curtis Wong, Jonathan Fay; Microsoft Research Goals Integration of data sets and one-click contextual access Easy access and use In just over a little more than two months, a million users have downloaded, installed and launched the application (2,206,497 unique sessions) We invite you to experience it!

24 Berkeley Water Center Understanding regional hydrologyProject Organization Jim Hunt, Dennis Baldocchi, UC Berkeley Deb Agarwal, Lawrence Berkeley Laboratory Catharine van Ingen, MSR bwc.berkeley.edu Web site external datacube access eddy.lbl.gov Database archive Data ingest and cube development Xena.lbl.gov heavy database queries and cube development Sharepoint secured data file download tas.lbl.gov DC for flux domain wally.lbl.gov MatLab and ArcGIS for key scientists hagar.lbl.gov Backup gateway Goals Enable rapid scientific data browsing for availability and applicability Enable environmental science via data synthesis from multiple sources Proof Points Environmental Data Server, (SharePoint), serves 921 site years of carbon-climate field data from 160+ field teams to 60+ paper writing teams (800M values) Multiple projects now leveraging same SQL Server database and data cube approach CUAHSI consortium: 100 universities collaborating on hydrology

25 Carbo-Climate Synthesis (BWC Dennis Baldocchi et al)What is the role of photosynthesis in global warming? Measurements of CO2 in the atmosphere show 16-20% less than emissions estimates predict The difference is either due to plants or ocean absorption. Communal field science – each investigator acts independently. Cross site studies and integration with modeling increasingly important Sharepoint site 921 site-years of data from 240 sites around the world; 80+ site-years now being added 60+ paper writing teams American data subset is public and served more widely Summary data products greatly simplify initial data discovery The plot is comparing previous values reported in the literature with values in the new dataset. These sorts of comparisons are helping the scientists trust the data as well as get their heads around what “accuracy” we’re talking about given the uniformity of processing (different from what the scientist would do independently). The Excel sheet is one of our summary data products – find your site-year at a glance with color coding for gap-fill quality, % of data actually present….

26 Mashup of Ameriflux Sites

27

28 Computational Biology Web ToolsBetter vaccine design through improved understanding of HIV evolution Project Organization Bruce Walker & Zabrina Brumme, Mass General Philip Goulder, Oxford Richard Harrigan, University of British Columbia David Heckerman, Jonathan Carlson and Carl Kadie, MSR Goals Use machine learning and visualization tools developed at Microsoft, which require HPC, to build maps of within-individual evolution of the HIV virus Proof Points Discovered epitope decoys that could have predicted recent failure of Merck vaccine Patent filed on new method for learning graphical models from data Algorithms and medical results published in Science and Nature Medicine MSR Computational Biology Tools published (Source on CodePlex)

29 Supporting researchers worldwideAdding Semantics to Software Tools

30 Data Acquisition & ModelingResearch Pipeline Data Acquisition & Modeling Collaboration Analysis Disseminate & Share Archiving Data Acquisition and Modeling Data capture from source, cleaning, storage, etc. SQL Server, SSIS, Windows WF Support Collaboration Allow researchers to work together, share context, facilitate interactions SharePoint Server, One Note 2007 (shared) Data Analysis, Modeling, and Visualization Mining techniques (OLAP, cubes) and visual analytics SQL Analysis Services, BI, Excel, Optima, SILK (MSR-A) Disseminate and Share Research Outputs Publish, Present, Blog, Review and Rate Word, PowerPoint Archiving Published literature, reference data, curated data, etc. SQL Server Microsoft has technologies that can offer end-to-end support

31 Data Acquisition & ModelingCollaboration Analysis Disseminate & Share Archiving Chemistry Drawing for Office Peter Murray Rust, Univ. of Cambridge Murray Sargent, Office Geraldine Wade, Advanced Reading Technologies Goals Support students/researchers in simple chemistry structure authoring/editing Enable ecosystem of tools around lifecycle of chemistry-related scholarly works Support the Chemistry Markup Language Proof of concept plug-in Execution MSR Developer to work on the proof of concept Post-doc in Cambridge to use plug-in and give feedback and move their chemistry tools to .NET and Office Advanced Reading Technologies to create necessary glyphs

32 A “Chemistry Zone” in a Word document and the CML representation (in pseudo-XML) stored inside the OOXML document

33 Data Acquisition & ModelingCollaboration Analysis Disseminate & Share Archiving Semantic Annotations in Word Phil Bourne and Lynn Fink, UCSD Goals Semantic mark-up using ontologies and controlled vocabularies Facilitate/automate referencing to PDB (and other resources) from manuscript Conversion of manuscript to NLM DTD for direct submission to publisher Scenario Authors do not need to be aware of the use of semantic technologies A domain-specific ontology is downloaded and made available from within Microsoft Word 2007 Authors can record their intention, the meaning of the terms they use based on their community’s agreed vocabulary Attribution: Richard Cyganiak

34 Semantic annotations in Word

35 Data Acquisition & ModelingCollaboration Analysis Disseminate & Share Archiving Research Output Repository Famulus UIs Desktop Tools Syndication Interop Search A platform for building services and tools for research output repositories Papers, Videos, Presentations, Lectures, References, Data, Code, etc. Relationships between stored entities Goals Support the MSR publishing and dissemination platform for all researcher outputs Enable a tools and services ecosystem for “research output” repositories on MS technologies Execution Support Eprints and Dspace front ends Deployment within MSR early Q2 Release to the community late Q2 Built on SQL Server Entity Framework

36 Research Output Repository PlatformA Semantic Computing platform A hybrid between a relational database and a triple store Triple stores Evolution friendly Poor performance No need to model everything in advance Semantic interpretation at the application level Relational schema Evolution not so easy Great opportunities for optimization Model everything in advance Research Output Repository Platform Maintain a balance Try to model the frequently used entities in our app domain Try to capture the frequently used relationships Allow for extensibility (Relationships, Properties)

37 Research Output Repository PlatformPDF file Lecture on 2/19/2008 contains is representation of PowerPoint presentation authored by organized by tony presented by Elizabeth, Sebastien, Matthew, Norman, Brian, Sarah, George, Roy

38 .NET Map Network Analysis VisualizationData Acquisition & Modeling Social Networking & Collaboration Data Analysis and Visualization Disseminate & Share Archiving .NET Map Network Analysis Visualization Project Organization Marc Smith, Senior Research Sociologist (MSR) Goals Research in the visualization of interaction networks Support for directed graphs Relationship analysis Proof Points Standalone tools on Windows Available as an Excel 2007 plugin

39 eScience and Semantic Computing meet the CloudThe cyberinfrastructure for the next generation of researchers

40 The Future: Software plus Services for Science?Expect scientific research environments will follow similar trends to the commercial sector Leverage computing and data storage in the cloud Scientists already experimenting with Amazon S3 and EC2 services, with mixed results For many of the same reasons No resource sharing across different research labs High storage costs Low resource utilization Excess capacity High costs of reliably keeping machines up-to-date Need less support for developers, system operators

41 Trident – Scientific WorkbenchWorkflow for Ocean Observatories, part of an “oceanographer’s workbench” Jim Gray

42 Data Acquisition & ModelingCollaboration Analysis Disseminate & Share Archiving Trident Scientific Workflow Workbench Univ. of Washington and Monterey Bay Aquarium Research Institute Scientific workflow workbench to automate the data processing pipelines of the world’s first plate-scale undersea observatory Goals From raw data to useable data products Focusing on cleaning, analysis, re-gridding, interpolation Support real time, on-demand visualizations Custom activities and workflow libraries for authoring Visual programming accessible via a browser Trial Cloud Services for science Proof Points A scientific workflow workbench for a number of science projects, reusable workflows, automatic provenance capture. Demonstrate scientific use of Windows WF, HPCS, SQL Server and Cloud Service SSDS

43 Towards a smart cyberinfrastructure?Collective intelligence If last.fm can recommend what song to broadcast to me based on what my friends are listening to, the cyberinfrastructure of the future should recommend articles of potential interest based on what the experts in the field that I respect are reading? Examples are emerging but the process is presently manual (Connotea, BioMedCentral Faculty of ) Semantic Computing Automatic correlation of scientific data Smart composition of services and functionality Cloud computing to aggregate, process, analyze and visualize data

44 A world where all data is linked…Data/information is inter-connected through machine-interpretable information (e.g. paper X is about star Y) Social networks are a special case of ‘data meshes’ Important/key considerations Formats or “well-known” representations of data/information Pervasive access protocols are key (e.g. HTTP) Data/information is uniquely identified (e.g. URIs) Links/associations between data/information Attribution: Richard Cyganiak

45 …and stored/processed/analyzed in the cloudvisualization and analysis services Vision of Future Research Environment with both Software + Services scholarly communications domain-specific services search books citations blogs & social networking Reference management instant messaging The Microsoft Technical Computing mission to reduce time to scientific insights is exemplified by the June 13, 2007 release of a set of four free software tools designed to advance AIDS vaccine research. The code for the tools is available now via CodePlex, an online portal created by Microsoft in 2006 to foster collaborative software development projects and host shared source code. Microsoft researchers hope that the tools will help the worldwide scientific community take new strides toward an AIDS vaccine. See more. identity Project management mail notification document store storage/data services knowledge management compute services virtualization knowledge discovery

46 Acknowledgements The ideas presented here were developed with input from many colleagues in the community and at Microsoft Research: Thanks are due to David De Roure, Jeremy Frey, Carole Goble, Peter Murray-Rust, Alan Rector, Nigel Shadbolt and Alex Szalay And special thanks to Roger Barga, Savas Parastatidis and Evelyne Viegas at Microsoft Research who have tried to educate me … See for some more details of Microsoft’s activities in Scientific and Technical Computing

47