Big Data Analytics Learning Lab 1 UN Data Innovation Lab 4

1 Big Data Analytics Learning Lab 1 UN Data Innovation La...
Author: Daniel Hutchinson
0 downloads 4 Views

1 Big Data Analytics Learning Lab 1 UN Data Innovation Lab 4 UN Data Innovation Lab 4 University of Nairobi March 13-14, 2017

2 Agenda Introduction to Big Data Big Data AnalyticsWhat it is and why it matters Big Data Analytics Putting Big Data to work Creating a Big Data-Enabled Organization Bringing Big Data Analytics home Case Study ‘Nowcasting’ economic activity in Colombia

3 Introduction to Big Data What it is and Why it Matters01

4 Analysis of Streaming Data Different Forms of DataWhat is Big Data? “Big Data” exceeds the capacity of traditional analytics and information management paradigms across what is known as the 4 V’s: Volume, Variety, Velocity, and Veracity Veracity Velocity Variety Volume Uncertainty of Data Analysis of Streaming Data Different Forms of Data Scale of Data With exponential increases of data from unfiltered and constantly flowing data sources, data quality often suffers and new methods must find ways to “sift” through junk to find meaning The speed at which data is generated and used. New data is being created every second and in some cases it may need to be analyzed just as quickly Represents the diversity of the data. Data sets will vary by type (e.g. social networking, media, text) and they will vary how well they are structured Reflects the size of a data set. New information is generated daily and in some cases hourly, creating data sets that are measured in terabytes and petabytes The term “Big Data” encompasses structured, semi-structured and unstructured information created inside an organization or available for sale by commercial data aggregators and for free by governments – from demographic and psychographic information about consumers to product reviews and commentary; blogs; content on social media sites; and data streamed 24/7 from mobile devices, sensors, and tech-enabled devices.

5 Traditional Techniques & Issues Big Data DifferentiatorsThe Promise of Big Data Even more important than its definition is what Big Data promises to achieve: intelligence in the moment. Traditional Techniques & Issues Big Data Differentiators Does not account for biases, noise and abnormality in data Data is stored, and mined meaningful to the problem being analyzed Keeps data clean and processes to keep ‘dirty data’ from accumulating in your systems Veracity In real-time: Dynamically analyze data Consistently integrate new information Auto deletes unwanted to ensure optimal storage No real time analysis Velocity Effectively used, Big Data can transform data into insights and intelligence, delivered when and where they’re needed to make and implement better strategic and operational decisions. For the vast majority of organizations, having access to the right information at the right time and place – to interact with customers, build new products, improve customer service and more – is not yet a reality. Limitations in skills, storage costs, tools, connectivity, quality, and availability have made the goal unobtainable – until now. Compatibility issues Advanced analytics struggle with non-numerical data Frameworks accommodate varying data types and data models Insightful analysis with very few parameters Variety Analysis is limited to small data sets Analyzing large data sets = High Costs & High Memory Scalable for huge amounts of multi-sourced data Facilitation of massively parallel processing Low-cost data storage Volume

6 Types of Big Data Variety is the most unique aspect of Big Data. New technologies and new types of data have driven much of the evolution around Big Data. Twitter, Linkedin, Facebook, Tumblr, Blog, SlideShare, YouTube, Google+, Instagram, Flickr, Pinterest, Vimeo, WordPress, IM, RSS, Review, Chatter, Jive, Yammer, etc. Docs Sensor data Public Web Archive Media Social Media Medical devices, smart electric meters, car sensors, road cameras, satellites, traffic recording devices, processors found within vehicles, video games, cable boxes, assembly lines, office building, cell towers, jet engines, air conditioning units, refrigerators, trucks, farm machinery, etc.. XLS, PDF, CSV, , Word, PPT, HTML, HTML 5, plain text, XML, JSON, etc. Images, videos, audio, Flash, live streams, podcasts, etc. Government, weather, competitive, traffic, regulatory, compliance, health care services, economic, census, public finance, stock, OSINT, the World Bank, SEC/Edgar, Wikipedia, IMDb, etc. Archives of scanned documents, statements, insurance forms, medical record and customer correspondence, paper archives, and print stream files that contain original systems of record between organizations and their customers Event logs, server data, application logs, business process logs, audit logs, call detail records (CDRs), mobile location, mobile app usage, clickstream data, etc. Project management, marketing automation, productivity, CRM, ERP content management system, HR, storage, talent management, procurement, expense management Google Docs, intranets, portals, etc. Business Apps Machine Log Data

7 “Single sources of data are no longer sufficient to cope with the increasingly complicated problems in many policy arenas.” 1 Big data “is not notable because of its size, but because of its relationality to other data. Due to efforts to mine and aggregate data, Big Data is fundamentally networked.”2 (1) M. Milakovich, “Anticipatory Government: Integrating big data for Smaller Government”, in Oxford Internet Institute “Internet, Politics, Policy 2012” Conference, Oxford, 2012 (2) D. Boyd and K. Crawford, “Six Provocations for big data,” in A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 2011

8 Why is Big Data valuable?We have identified 5 key areas where Big Data is uniquely valuable: Enhanced visibility of relevant information and better transparency to massive amounts of data. Improved reporting to stakeholders. Next generation analytics can enable automated decision making (inventory management, financial risk assessment, sensor data management, machinery tuning). Segmentation of population to customize offerings and marketing campaigns (consumer goods, retail, social, clinical data, etc). Exploration for, and discovery of, new needs, can drive organizations to fine tune for optimal performance and efficiency (employee data). Discovery of trends will lead organizations to form new business models to adapt by creating new service offerings for their customers. Intermediary companies with big data expertise will provide analytics to 3rd parties. Accessibility to Data Decision Making Marketing Trends Performance Improvement New Business Models/Services

9 $1 Trillion One study estimated the potential value of big data in the U.S. health care, European public sector administration, global personal location data, U.S. retail, and global manufacturing to be over $1 trillion U.S. dollars per year.1 $41 Billion Another study estimated the value of big data in the areas of customer intelligence, supply chain intelligence, performance improvements, fraud detection, and quality and risk management to be $41 billion per year in the UK alone.2 (1) J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey & Company, 2011. (2) Centre for Economics and Business Research, “Data equity: unlocking the value of big data,” SAS, 2012.

10 Not to be confused with…Structured, semi-structured or unstructured information distinguished by one or more of the four “V”s: Veracity, Velocity, Variety, Volume. Big Data Crowdsourced Data Open Data Public, freely available data Big Data is often confused with other terms that also represent, in their own ways, fundamental shifts in the way we collect, store, and use data. In some cases, these categories may overlap. Crowdsourced Data + Big Data  Credit card transaction data Big Data + Open Data  Government-collected weather data Big Data + Crowdsourced Data + Open Data  Google Search trends (Google.com/Trends) Data collected through contributions from a large number of individuals. Graphic and definitions based on “Big Data in Action for Development,” World Bank, worldbank.org

11 Big Data Analytics Putting Big Data to Work02

12 It’s not just about the data…It is important to understand the distinction between Big Data sets (large, unstructured, fast, and uncertain data) and ‘Big Data Analytics’. Big Data + Big Data Analytics Refers to the DATA only Methods of using Big Data to generate insight Machine Learning/Deep Learning IoT (Internet of Things) & Sensor Analytics Modeling Willingness-to-Pay Natural Language Processing Analyzing Scale Creating a Streaming Consumer Behavior Data Lake Leveraging a computer’s ability to learn without being explicitly programmed to solve business problems Understanding value drivers from the ever-growing network of connected physical objects and the communication between them Mining product reviews to estimate willingness-to-pay for product features Understanding human speech as it is spoken through application of computer science, AI, and computational linguistics Using distributed computing and machine learning tools to analyze hundreds of gigabytes of data Mining social data in real time to understand when and where consumers are making choices 1 2 3 4 5 6

13 … It’s also about what, how, and why you use itBig Data Analytics – the process of harnessing Big Data to yield actionable insights – is a combination of five key elements: Decisions Analytics Data Technology Mindset & Skills The value of Big Data Analytics is driven by the unique decisions facing leaders, companies, and countries today. In turn, the type, frequency, speed, and complexity of decisions drive how Big Data Analytics is deployed. To leverage the variety and volume of Big Data while managing its volatility, advanced analytical approaches are necessary, such as natural language processing, network analysis, simulative modeling, artificial intelligence, etc. Big Data Analytics is about operationalizing new and more data, but it is also about data quality, data interoperability, data disaggregation, and the ability to modularize data structures to quickly absorb new data and new types of data. To store, manage, and use Big Data often requires investments in new technologies and data processing methods, such as distributed processing (e.g., Hadoop), NoSQL storage, and Cloud computing. Big Data Analytics requires firm commitment to using analytics in decision-making; a decisive mentality capable of employing in-the-moment intelligence; and investment in analytical technology, resources, and skills. This leads us to the main part of the talk – Big Data and Analytics. Before we dive into it – it is worth clarifying what we mean by ‘Big Data and Analytics’. There have been a number of definitions of this and also the usage of the word has been changing over the past couple of years. To us big data is about 5 key aspects Decisions – how are decisions made today? How can we make more effective and efficient decisions? This leads to insights that we need to generate from analytics to make better decisions This in turn requires large volumes of data, in some cases real-time, and also a broad class of different types of data Given the amount of data and the sophisticated analytics we do we also need additional technologies to support it –e.g. distributed processing and going beyond traditional relational databases into unstructured and alternative databases Last, but not least, it is also about the mindset change and the different skills required for doing all of the above.

14 Big Data Analytical CapabilitiesContinuing increases in processing capacity have opened the door to a range of advanced algorithms and modeling techniques that can produce valuable insights from Big Data. Traditional Emerging Structured Unstructured A/B/N Testing Experiment to find the most effective variation of a website, product, etc Sentiment Analysis Extract consumer reactions based on social media behavior Complex Event Processing Combine data sources to recognize events Predictive Modeling Use data to forecast or infer behavior Regression Discover relationships between variables Time Series Analysis Discover relationships over time Classification Organize data points into known categories Simulation Modeling Experiment with a system virtually Spatial Analysis Extract geographic or topological information Cluster Analysis Discover meaningful groupings of data points Signal Analysis Distinguish between noise and meaningful information Visualization Use visual representations of data to find and communicate info Network Analysis Discover meaningful nodes and relationships on networks Optimization Improve a process or function based on criteria Deep QA Find answers to human questions using artificial intelligence Natural Language Processing Extract meaning from human speech or writing * For more information on these analytic methods, see Appendix.

15 Forward-Looking vs. Rear-View AnalyticsBig Data Analytics improves the speed and efficiency with which we understand the past, and opens up entirely new avenues for preparing for and adapting to the future. What happened? Describe, summarize and analyze historical data What should be done? Recommend ‘right’ or optimal actions or decisions How do we adapt to change? Monitor, decide, and act autonomously or semi-autonomously What could happen? Predict future outcomes based on the past Observed behavior or events Non-traditional data sources such as social listening and web crawling Forward-looking view of current and future value Sentiment Scoring Graph analysis and Natural Language Processing to identify hidden relationships and themes Dual objective models Behavioral economics Real-time product and service propositions (graph analysis, entity resolution on data lakes to infer present customer need) Rapid evaluation of multiple ‘what-if’ scenarios Optimization decisions and actions Monitor results on a continuous basis Dynamically adjust strategies based on changing environment and improved predictions Agent-based and dynamic simulation models, time-series analysis Descriptive Analytics Predictive Analytics Prescriptive Analytics Continuous Analytics Increasing Business Value Why did it happen? Identify causes of trends and outcomes Statistical and regression analysis Dynamic visualization Diagnostic Analytics Increasing Sophistication of Data & Analytics Rear-view Forward-looking

16 Examples of Big Data Analytics in ActionMarket Leaders are leveraging Big Data Analytics to generate value by starting with a business need and focusing on implementing actionable insights quickly and decisively Business Need Big Data Analytics Impact Company Business Need Data and Analytics Impact Greater tailoring of credit card offers to fit customer needs Statistical model based on public credit and demographic data to target customized products to customers Net revenue grew at a CAGR of 32% from 1994 to 2003; prompted competitors to shift focus to data and analytics Data-enabled engine prognostics, monitoring, maintenance and repair Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance Over 70% annual revenue from the aircraft engine division attributable to this service Search-to-purchase conversion by anticipating intent of a shopper’s search and delivering relevant results Semantic search, which enables discovery using algorithms that rank results via social signals from around the web Increases 10-15% the likelihood that a customer will complete their purchase – translating to millions of dollars in revenue Transformation from subscription streaming service to original content producer Analysis of data from 66 million subscribers’ viewing habits and preferences Revenue and subscriber base increased by 15% and 9% respectively in 2013 Leverage Internet of Things (IoT) by connecting machines to facilitate data-enabled prognostics, increase efficiency and reduce downtime Launched software to help airlines and railroads move their data to the cloud and predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost Estimated 1% reduction in fuel costs, projected to save the airline industry $30 billion over 15 years Sources: Examples of leaders and what do leaders do well and what is the value that they have been able to generate (what they do well is starting from a business problem, focusing on the insights, etc.)

17 Big Data Analytics in DevelopmentBig Data Analytics is making an equally impressive impact on Development interventions – allowing decision-makers to reach and serve previously neglected populations. Business Need Big Data Analytics Impact Company Business Need Data and Analytics Impact More transparent, reliable, and low-cost method to track inflation in Argentina Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies Government statistical offices shifting to accept Big Data. Central banks using Big Data to see day-to-day volatility. Understand how migrants act as arbitrageurs to bring labor markets into equilibrium Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.) Informing labor policy design in low-income countries to incentivize or disincentivize migratory behavior The city of Rio de Janeiro wanted to improve its emergency response by better predicting heavy rainfall and subsequent severe landslides and flooding The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center Rio has improved emergency response time by 30%, catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half-km basis Create a better ecosystem for mobile services in the agricultural sectors of Kenya, Tanzania, and Mozambique Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit worthiness, and incubate new mobile businesses with greater predictors of success M-PESA is being used to lower costs for farmers to receive loans and perform transactions with distributers and buyers, as well as to provide geography-specific market information Source:

18 Creating a Big Data-Enabled Organization Bringing Big Data home03

19 Step 1: Be Yourself Beginning with a clear understanding of the specific questions you intend to use Big Data Analytics to address can help guide where and which data solutions are deployed. Value enhancement Delivering future value Data-driven decision-making in real time Use analytics to develop new programs/opportunities Relies heavily on data supplied by others Often struggles to move away from exclusively intuitive decision-making Strategic Enabling strategy and improving performance Use analytics to reduce political divergence and drive consensus Real-time analytics to enable quick responses to events Use data to develop personalized services Need for more objective and higher quality data Tactical Value enablement Day to day operations Struggle to move from narrow focus on reactive operations to more proactive, comprehensive management of daily operations High value for digitization of operational processes across program units Often already proficient in traditional business intelligence Operational

20 Step 2: Secure People & SkillsThe competencies required of “data scientists” within an analytics organization or project converge from multiple skill domains. Subject Area or Domain Expertise Computer Science & Programming Statistical & Mathematical Organization-specific Information Knowledge Expertise in statistical techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights Deep understanding of industry, subject area, or research domain to help determine which questions need answering and on what frequency, specificity, or geography You could potentially also add another domain: Visualization and Communication Expertise. This becomes critical in enabling those who aren’t professional data analysts to interpret data. Comfort with visual art and design to: Turn statistical and computational analysis into user friendly graphs, charts, and animations Create insightful data visualizations (e.g., motion charts, word maps) that highlight trends that may otherwise go unnoticed Utilize visual media to deliver key message (e.g., reports, screens – from mobile screens to laptop/desktop screens to HD large visualization walls, interactive programs, and – perhaps soon – augmented reality glasses) Engage effectively with senior management, talk their language and translate the data-driven insights into decisions and actions Develop powerful, convincing messages for key stakeholders, to positively influence their course of action Comfort in programming across various languages, a thorough understanding of external and internal data sources, data gathering, storing, and retrieving methods which help combine disparate data sources to generate unique insights Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics

21 Step 3: Let objectives dictate structure, not vice versaHow analytics efforts or organizations are structured – whether reporting is vertically or horizontally aligned, how interconnected or autonomous separate units are, how resources and successes are shared – can influence efficiency and impact. Distributed Analytics Federated Analytics Centralized Analytics LOCAL CENTRAL Analytics Competency Center ETL Data Warehouse BI Applications Metadata Repository Data Mart LOCAL CENTRAL Analytics Competency Center CENTRAL Analytics Competency Center Metadata Repository Metadata Repository ETL ETL Data Warehouse Data Warehouse Data Mart Data Mart BI Applications BI Applications Objectives Adopt previously proven practices Highly focused analytics support Subject area-specific innovations Repeatable models Governance Aligning analytics to organization-wide strategy Data Warehouses, Marts, etc. Deployed locally Some data and models shared across groups Deployed and managed centrally Analytics Tools Managed locally Managed locally, but connected to group framework Controlled centrally, with units having access to shared resources Analytics Staff/ Competencies Placed within individual units Skills tailored to specific region or subject matter Placed within central analytics team, available as needed to support individual units Let structure and its consequences be conscious, proactive choices, rather than allowing the downstream effects of poor structure constrain or even force analytical capabilities.

22 Excellence (Regional)The ‘Hub-Spoke’ operating model often serves as a well-synchronized, connected system 4 3 2 1 Local Adoption of Practices Centers of Excellence (Regional) Competency Center (‘Standards’) Central Decision Hub Local Business Operations Global Business Strategy Local ‘Spoke’ 4 Local ‘Spoke’ 4 Local ‘Spoke’ Center of Excellence (Regional) 3 Local ‘Spoke’ 4 Local ‘Spoke’ 4 Sample Hub-Spoke Interaction Model Competency Center 3 Center of Excellence (Regional) 2 4 Central Decision Hub Centralized data management model, with strong interdependence between business units Responsible for aligning analytic priorities with business strategy Establishes Best Practices and supports Innovation (may reside in a few ‘hubs’) System decisions, process design, and other organizational decisions made centrally Competency Center Owns model development and repeatable winning routines Standardized reporting/analytics Data acquisition/vendor negotiation Shared customers and products, requiring seamless access to shared data Center of Excellence Market/region/subject area-specific innovation Specific skill focus / strengths (unique analytic strength in geospatial analytics, etc.) Operates as a centralized group, including business and IT Owns and is accountable for tools and processes for collaboration and visualization of data and analyses Local Adoption of Practices Market/region/subject area-specific decisions using local/subject area-specific data in centrally developed models Focus on speed of response Support local business units Adopt processes that have already been standardized High data integration across business units Local ‘Spoke’ Central Decision Hub Local ‘Spoke’ 4 1 Local ‘Spoke’ 4 ‘Standardization’ Local ‘Spoke’ 4 Center of Excellence (Regional) 3 Local ‘Spoke’ 4 Local ‘Spoke’ 4

23 Step 4: Invest in Appropriate InfrastructureBig Data introduces challenges related to data volume and variety, processing constraints, and new data structures that traditional data infrastructure is not equipped to support Objective Considerations Impact Identify the type of analysis that will be conducted and define which analytics capabilities will be employed Dictates performance needs along with data structures and processing architecture Interface could restrict the ability to perform analysis ad hoc and restrict ability to update Support for analysis specific data structures can improve performance and reduce analysis effort Define the data set that will be used for the analysis including its sources, size, and structure Size of data sets introduce need for scalable infrastructure and performance Variability of source data models and data set structure require data model flexibility Diverse sources will require scalability, model flexibility, and flexible interfaces Define the timeliness and frequency of the analysis results for reporting and downstream systems Frequency of analysis will dictate the processing architecture (batch or real time) The timeliness of the analysis will impact the need for scalability and performance In and out bound interfaces are defined by the use of data and required flexibility Analytics Capabilities Analysis Type Analysis Flexibility Analysis Structures Data Variety Size Structure Sources Application Frequency Speed Interfaces

24 Emerging Infrastructure OptionsTo harness Big Data, storage solutions must be able to support targeted analytics capabilities, data diversity and performance needs Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware NoSQL Embedded and persisted storage that implement data models through document, graph, and dictionary structures Cloud Computing Cloud computing can improve flexibility, scalability and cost management and enable a cohesive business strategy across a org Traditional challenges being addressed… Scalability Issues Data storage solutions need to provide flexible data models to better ingest unstructured and semi structured data Big Data set information extraction and queries require large volumes of processing cycles that can quickly scale Need to combine and link multiple data sources * For more information on these infrastructure options, see Appendix.

25

26 Guiding Principles – Illustrative, May be CustomizedSummary: Key Guiding Principles for developing best-in-class analytics organization Guiding Principles – Illustrative, May be Customized Establish the Analytics organization as an objective advisor for insight generation . Ensure responsiveness to business needs by balancing ‘consolidation’ with ‘distribution’ of analytics functionality where it makes sense. Innovate, invest in, and build new analytics capabilities, and gradually push them out to the business as user sophistication matures (e.g., data visualization). Prioritize strategic business value delivery over tactical outputs. Ensure adequate attention to user experience. Focus on speed, accuracy, and reusability. Optimize and manage work-flow to achieve maximum resource efficiency. Allow distributed analytics where it makes sense, but tightly govern and ensure cataloguing. Ensure a consistent feedback loop of all outputs that are created.

27 Case Study ‘Nowcasting’ Economic Activity in Colombia04

28 Situation In Colombia, the leading economic indicators used to analyze economic activity have an average lag of 10 weeks. This presents challenges for the well-timed design of economic policy and monitoring of economic shocks or trends. The Colombian Ministry of Finance looked for coincident indicators that could allow tracking the short-term trends of economic activity. Characteristics of Data Needed: Real-time Highly disaggregated – by sector, geography, etc. Statistically correlated with key economic trends (consumption, GDP, etc.) Robust enough of a sample to be representative of the economy as a whole Source:

29 Group Discussion What Big Data sources could the Colombian Ministry of Finance potentially use to reliably approximate sectorial economic activity in real-time? As a facilitator, it is important that during this discussion you encourage participants to keep as open-minded as possible, and try not to reveal what the Ministry actually chose to use.

30 Brainstorming BreakoutIn groups of 3-4, take five-ten minutes to brainstorm how the Ministry could approach answering the following questions: What data should it consider using? Is this data the Ministry already has available, or will this require the Ministry to acquire an entirely new source of data? How does the cost of acquiring this data – whether by their own collection or through external data partnership – compare to the expected benefits of using it? If this data is new to the Ministry, what entities may already have this data in possession? How might the Ministry ensure its staff have the skills necessary to acquire, manage, and use this data? Is this data uniquely complex such that it may require more advanced or entirely new skillsets? What should the Ministry consider in the way of data storage and security? How extensively may it be required to overhaul data storage infrastructure to accommodate using this data?

31 Solution Based on web searches performed by Google users, Google Trends (GT) provides daily information about the query volume for a given search term in a given geographic region. For Colombia, GT data are available at the departmental level and also for the largest municipalities. The Colombian Administrative Department for National Statistics (DANE – for its acronym in Spanish) combined indexes built using GT data with its own official economic activity data (both at the aggregate level and at the sectorial level) – both of which are publicly available – to construct leading indicators that determine, in real-time, the short- term trend of different economic sectors, as well as their turning points. In some sense, the GT data takes the place of traditional consumer-sentiment surveys. For example, the use of data for a certain keyword (such as the brand for a certain product) might be justified in the case a drop or surge in the web searches for that keyword could be linked to a fall or increase in its demand and, therefore, a lower or higher production for the specific sector producing that product. Source:

32 Example: “Ahorro” vs. Unemployment RateAhorro – savings

33 Example: “Ahorro” vs. Unemployment RateThese trends were shown to correlate with a high coefficient of correlation with traditional measures of unemployment.

34 Example: “Zapatos” vs. Employment RateZapatos – shoes

35 Example: “Zapatos” vs. Employment RateThese trends were shown to correlate with a high coefficient of correlation with traditional measures of employment.

36 Find Out More Melanie Thomas Armstrong Jean Young Bill StephensLeading Partner International Public Sector +1 (202) Jean Young Managing Director International Public Sector Data Analytics +1 (703) Bill Stephens Director International Public Sector Data Analytics +1 (703) Mariola Pogacnik Director United Nations & International Public Sector +1 (646) Ashraf Faramawi Manager International Public Sector Data Analytics +1 (202) Jared Nyarumba Manager Data Analytics, Africa This publication has been prepared for general guidance on matters of interest only, and does not constitute professional advice. You should not act upon the information contained in this publication without obtaining specific professional advice. No representation or warranty (express or implied) is given as to the accuracy or completeness of the information contained in this publication, and, to the extent permitted by law, PricewaterhouseCoopers LLP, its members, employees and agents do not accept or assume any liability, responsibility or duty of care for any consequences of you or anyone else acting, or refraining to act, in reliance on the information contained in this publication or for any decision based on it. © 2017 PricewaterhouseCoopers LLP. All rights reserved. In this document, “PwC” refers to PricewaterhouseCoopers LLP which is a member firm of PricewaterhouseCoopers International Limited, each member firm of which is a separate legal entity.

37 Appendix

38 Emerging Data Storage and Infrastructure Options

39 Building an Analytics Organization: Critical ComponentsEmerging Infrastructure – Data Storage Options Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware Introduction to Hadoop Hadoop is based on work done by Google in early 2000s (combination of Google File System (GFS) and MapReduce) Useful for analyzing copious amounts of complex data across multiple data sources Distributes data as it is initially stored in the system Applications are written in high-level code Computation happens where data is stored, whenever possible Data is replicated multiple times on the system for increased availability and reliability Faster and Lower Cost Analysis Linear Scalability Greater flexibility

40 Distributed Storage and Analytics: Hadoop vs. Traditional Data StoresCompared to traditional data stores, Hadoop provides greater flexibility when it comes to storing data and scaling to meet demand. Hadoop vs. Traditional Data Stores Data Structure Supports both structured and unstructured data. Supports only structured data. Data Size Unlimited Limited depending on selected RDBMS. Data Formats Supports various serialization and data formats (e.g. text, JSON, XML, etc) Supports a single tabular data format. Scaling Distributed scaling from the ground up – simply add more nodes to increase capacity Scaling is possible, but is typically more complex and cannot be performed at a node level. Distributes data as it is initially stored in the system - Individual nodes can work on data local to those nodes - No data transfer over the network is required for initial processing Applications are written in high-level code  - Devs do not worry about network programming, temporal dependencies etc Nodes talk to each other as little as possible - Devs should not write code which communicates between nodes - "Shared nothing' architecture Data is spread among machines in advance - Computation happens where data is stored, whenever possible - Data is replicated multiple times on the system for increased availability and reliability Sources:

41 Building an Analytics Organization: Critical ComponentsEmerging Infrastructure – Data Storage Options NoSQL Embedded and persisted storage that implement data models through document, graph, and dictionary structures NoSQL - Storage Types Key – Value Store Columnar Store Document Store Graph Store Increasing Data Complexity Pros: Simplicity & Scalability Cons: Lack of advanced features/queries Pros: Scalability & Flexibility Cons: Complexity Pros: Easy to Use Cons: Scalability Pros: Graph Joins Cons: Flexibility Solution Examples

42 Building an Analytics Organization: Critical ComponentsEmerging Infrastructure – Data Storage Options Cloud Computing The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud computing can transform your entire organization — people, processes, and systems Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs. The key benefits, beyond consolidation, include standardized application and development environments, resulting in better controlled and more efficient application lifecycles. Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”

43 Text Mining and Natural Language Processing

44 Natural Language ProcessingData Mining, Text Mining, and Natural Language Processing What are they and how are they used? Natural Language Processing NLP is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications. Text Mining Analysis of large quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information Data Mining Extraction of implicit, previously unknown, and potentially useful information from data Source: Text Mining, Ian Witten, 2004

45 Natural Language ProcessingNatural Language Processing and Text Mining What are they and how are they used? Natural Language Processing Text Mining Purpose and Overview NLP (Natural Language Processing) applies statistical or rules based computational techniques to evaluate and model texts at various levels of linguistic analysis in order to identify key concepts, enable intelligent processing and draw inferences Purpose and Overview Text mining represents a system of statistical analysis and classification algorithms that are employed to explore groups of natural language texts and identify useful patterns, relationships, and knowledge Objectives Deep analysis and structuring of individual texts through phrase identification, part of speech tagging, and word disambiguation Identification of a text’s message or meaning through the use of linguistic analysis: Syntactic - sentence structure or breakdown Lexical - meaning of words within the context of use Semantic - logical meaning of phrases or text Discourse - connections among sentences and phrases that define the topic Generate natural language sentences or texts as a response to an input/ question using a context text or knowledge base Objectives Use of data mining techniques and statistical methods to conduct a shallow analysis of groups of documents and make accessible knowledge within structured/semi-structured texts Develop a structured view of a documents contents in order to develop linkages among texts for classification, categorization, knowledge discovery and search Conduct a statistical analysis of the word/sentence usage and attributes in order to identify key phrases, summarize texts, extract information from groups of texts, and discover new knowledge using the information within texts

46 Rosetta Linguistic PlatformNLP Tools Tools and APIs that provide capabilities to parse and structure natural language texts for machine analysis Tool Description Analysis Type OpenNLP A machine learning based toolkit for the processing of natural language text. Link Tokenization sentence segmentation Part-of-speech tagging Named entity extraction Chunking, parsing Coreference resolution. GATE A Java suite of tools that can perform natural language processing tasks for multiple languages. Link Information extraction Part of speech tagging Tokenizer Sentence splitter NLTK A suite of libraries and programs for symbolic and statistical natural language processing Python. Link Part of speech tagging, Word categorization Text classification Stanford NLP Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs. Link Including tokenization Named entity recognition Parsing Classification Segmentation Coreference Resolution LingPipe A tool kit for processing text using computational linguistics. Link Sentiment analysis Entity recognition Clustering Topic classification Sentence detection Disambiguation MontyLingua A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java. Link Text generation Stemming Phrase chunking Rosetta Linguistic Platform A suite of linguistic analysis components that integrate into applications for mining unstructured data. Link Language Identification Name, places, and key concept extraction name matching name translation

47 SAS Sentiment AnalysisText Mining/Analytics Tools Tool kits that provide capabilities for identifying and analyzing features within individual or groups of texts Tool Description Analysis Type RapidMiner An open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics. Link Document classification Sentiment analysis Topic tracking Data mining Traditional analytics SAS Text Miner A suite of text processing and analysis tools. Link, Text Parsing Filtering Feature Extraction Topic Clustering VisualText Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link Information extractions Summarization Categorization Data Mining Document Filtering Natural Language Search SAS Sentiment Analysis Commercial tool that is dedicated to customer sentiment analysis. Link Customer sentiment monitoring sentiment discovery Textifier Tool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT). Link Topic modeling, Information retrieval Document analysis Social media analysis Infinite Insight System for automatically preparing and transforming unstructured text attributes into a structured representation. Link Term frequency Term frequency inverse Document frequency Root word coding synonym identification Customization of stop words Stemming rules Concepts merging Clustify Software for grouping related documents into clusters, providing an overview of the document set and aiding with categorization. Link Document clustering

48 Text Mining/Analytics Tools ContText Mining/Analytics Tools Cont. Tool kits that provide capabilities for identifying and analyzing features within individual or groups of texts Tool Description Analysis Type Attensity Analyze Customer analytics applications that help analyze high volumes of customer conversations across multiple channels. Link Unstructured communication analysis sentiment analysis consumer profiling ReVerb A program that automatically identifies and extracts binary relationships from English sentences. Link Information extraction Topic Identification Topic Linking Open text summarizer Open source tool for summarizing texts. Link Document summarization Open Calais Web based API that is used to analyze content and extract topics or information. Link Attribute/feature extraction Fact identification Knowledge Search Family of techniques tools for searching and organizing large data collections. Link Semantic Analysis KH Coder A free software for Quantitative Content Analysis or Text Mining Link Text Parsing document search Network analysis

49 Resources Tutorials, Tools, Applications, and Research GroupsLink and Description Tutorials and Overviews Text Mining Overview Text Mining Activites Text Mining Tutorial Text Mining Process NLP Introduction NLP Overview NLP Concepts Research Groups and Papers Tools and Data Sets NLP Toolkit List NLP Tools Text Mining Tools Tools by Function

50 DeepQA, Image Analytics, and Audio Analytics

51 DeepQA Overview and IntroductionWhat is DeepQA? DeepQA forms that core of Watson, the open domain question analysis and answering system The DeepQA stack is comprised of set of search, NLP, learning, and scoring algorithms DeepQA operates on a distributed computing infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture What is the target problem set? Understanding the meaning and context of human language Searching and retrieving information from large library of unstructured information Identifying accurate and precise answers to questions that are complex and must sourced from a large knowledge set

52 DeepQA Infrastructure Technology Data Management and SearchLinks Unstructured Information Architecture UIMA Link SQL Server MySQL Apache Derby Java Natural Language Toolkit Open NLP Stanford NLP Map/Reduce Apache Hadoop Commonsense Knowledgebase OpenCYC Open Mind Common Sense Triple Store Apache Jena OpenAnzo Text Search Lucene Open FTS

53 DeepQA Infrastructure Technology Platform and AdministrationLinks Web Server Apache Link Virtualization Host VMWare Zen Distributed File System Apache Hadoop OpenAFS File Management/ Archival rSync OS Fedora Cloud Management Extreme Cloud Administration Open Nebula

54 Relationship Management Technical TroubleshootingBusiness Applications DeepQA provides capabilities that can facilitate knowledge discovery, improve customer interaction, and uncover hidden facts Overview Objectives Knowledge Discovery Search internal and external unstructured/structured information assets to uncover previously unknown knowledge Identify information about a subject through deep analysis of internal and external information sources Answer questions about a business problem or trend that may be difficult to analyze within traditional data sources E-Discovery Search documents and communications to uncover relevant information associated with a specific topic Identify business topics and trends within communication and documents Search for non compliance activities within internal and external data sources Contract Evaluations Search through single or multiple contracts to answer specific questions about the nature of the contract Identify key facts or issues that comprise a contract or sets of contracts Identify contracts or legal documents that contain similar entities or features Relationship Management Provide the ability to interact with consumers providing precise responses to technical and open domain questions Provide a platform for automatically answering consumer questions about products or services Reduce reliance on call centers and improve interaction with consumers Consumer Discovery Search consumer communications, social media, and sales information to identify opportunities and demographics Identify background information about consumers Identify consumer qualities that create risks or represent opportunities Technical Troubleshooting Find answers to technical and process problems through Utilize unstructured data and communications to identify solutions or root causes to system and process problems

55 Commonsense Reasoning Information RetrievalAreas for Further Research Infrastructure/Tools and Search Technologies/Concepts Topic Research Tools Hadoop Map/Reduce The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc. OpenNLP A Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA OpenCYC An open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies UIMA An architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities Lucene A text search platform. Further research is needed to understand the library and how to incorporate it into UIMA Search Text Search Scoring Algorithms are used to score search results based on their alignment with the question. Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA. Triple Store Search Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and technologies behind these data storage mechanisms Commonsense Reasoning Research is required to understand the branch of AI, technologies and role within DeepQA. Document/ Information Retrieval Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.

56 Areas for Further Research Machine Learning and Natural Language ProcessingTopic Description Machine Learning MetaLearners Research the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results Question Classification Identify techniques and models that can be employed to analyze and classify questions Search Ranking Models Research models are available for ranking search results based on the various search and recall techniques that are employed for a question NLP Logical Form Analysis Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text Semantic Structure Analysis Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search Relationship Analysis Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set Feature Extraction Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search Phrase Analysis Identify algorithms and tools that can be applied to extract key phrases from text based on a search context

57 URLs Overviews and ApplicationsLinks Background Documents The AI Behind Watson How to build a Watson Jr. Building your own Watson Algorithms behind Watson Overview of the technology behind Watson DeepQA Project Page Applications and Articles Watson and your business Understanding the DeepQA Process The future of DeepQA DeepQA for e-discovery

58 Image Analytics Overview How can we extract insight from images and video?The process of pulling relevant information from an image or sets of images for advanced classification and traditional analysis Applies image capture, image processing, and machine learning techniques to extract, quantify, and structure, image information Advantages Provides a method to structure, organize, and search information that is stored within images Offers an additional data set that can be applied to understanding consumer behavior, automating business processes, and discovering knowledge enterprise content

59 Image Analytics Tools There are few standalone packages that are capable of performing robust image analysis; however, solutions can be developed using existing frameworks and analytics toolkits Tool Overview Image Processing Computer Vision Machine Learning OpenCV Open source library of computer vision functions that is accessible via C, Java, and Python X PAXit Image Analysis Integrated image analysis platform that provides basic feature identification functions ImageJ Java based image processing platform that can be accessed via an API and expanded with custom plugins PIL Python image processing library PyBrain A modular machine learning library for Python

60 URLs Tutorials, Tools, Applications, and Research GroupsLink and Description Tutorials Tutorial on Image Processing and Analysis Online Book of Algorithms for Computer Vision Online Machine Vision Book Research Groups and Papers Computer Vision Group CMU Machine Vision Group Stanford Machine Vision Group Tools and Data Sets Image Analysis and Mining Framework Image Mining Software

61 Audio Analytics Overview How can we extract insight from audio and voice media?The process of capturing audio and analyzing its features as to extract content and context of an event Applies speech analysis and signal processing principles to structure audio information for analysis via NLP or traditional analytics techniques Advantages Provides a method for identifying events or common patterns within sound bytes Offers a way of capturing not only the content and topics within a conversation, but also the emotions and context

62 Power and Intensity of SoundAudio Analytics: Capabilities and Insights What data can we capture from sound bites that can be used to enhance other data or analysis? Audio Event Time Loudness/Intensity Frequency Rate Power and Intensity of Sound Sound or Pitch Information Points Event – audio events are identified as changes in sound patterns and or the intensity over time Rate – defines how quickly a sound or a pattern of sound is occurring and can be used to evaluate the nature of an exchange, the state of the sound source, and context of the topic Power and Intensity – measures the loudness of the sound or event and provides a way of evaluating the mood or emotion of the sound source Sound and Pitch – a measure of the sound quality and can serve as a tool for isolating separate audio events or sources as well as measuring changes to the sound source

63 Audio Analytics ApplicationsAnalysis Objectives Voice Recognition Analyze conversations to capture speech as text based dialog Capture and structure the content of conversations Utilize structured speech as an input to text mining and natural language processing capabilities Combine phone based conversations with other interaction data sets Sound Matching Analyze sound clips to identify specific events taking place Monitor customer interactions or business operations to capture events in real time Use captured events for comparison, categorization and analysis with other data points Sentiment Analysis Monitor phone calls with customers to uncover sentiment towards the experience and/or products/services Capture the content of the conversation and conducting sentiment analysis based on word choice Analyze the pitch, loudness, and rate of consumer speech to identify emotional state during the conversation and its cause Employee /Customer Screening Monitor customer and job candidate conversations to extract information from word usage and speech patterns that can inform or improve a screening process Analyze prescreen phone conversations to assess job candidate personality, interest in job, and fit to job requirements Analyze customer conversations to assess level of risk and honestly when applying for a product or filing claim/complaints

64 Information RetrievalAudio Analytics Tools There are few tools on the market that provide a broad range of audio analysis capabilities. However, basic audio analysis and natural language tool kits can be combined for robust analytics Tool Overview Audio Processing Information Retrieval Clam A C++ library that provides varying level of audio processing and information retrieval capabilities X CallMiner A tool that is capable of translating calls to a more structured text data set and combining with other communication forms Nuance Logs calls and structures audio for text based search and retrieval yaafe Aduio feature extraction toolkit with wrappers for several languages PRAAT Multiple platform audio analysis toolkit

65 URLs Tutorials, Tools, Applications, and Research GroupsLink and Description Tutorials Overview of audio features for seniment analysis Lecture on Audio Features and Information Overview of audio analysis Research Groups and Papers National Center for Voice and Speech Tools and Data Sets Audio analysis package Audio Mining Software

66 Social Network Analysis

67 Collaboration Analysis Organization DevelopmentApplications Analyze organizational structures to identify opportunities that can improve communication, productivity, and collaborations Analysis Objectives Collaboration Analysis Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures Identify team structures that are not effective Identify informal organizational structures Identify individuals/roles or groups that are influential to collaborative work environments Content/ Knowledge Management Evaluate how knowledge or content is diffused and accessed within an organization Improve content and knowledge distribution Identify content bottlenecks, open communication flows, and establish channels Explore impact of new communication methods Community Mining Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks Improved structures for key organizational functions. Improved information flows Identify potential bottlenecks for organizational functions Identify cultural patterns to build other communities Organization Development Explore formal and informal organization structures and how individuals work with one another to improve the design of the organization Improve hierarchy and structure of organization to better align with the informal practices Identify team members that are effective leaders and would impact the organization if promoted Social Network Analysis

68 Applications Analyze network structures, communication channels, and information flows to identify operational enhancements Analysis Objectives Disaster recovery planning Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans Identify communication improvements to disaster recovery teams Identify weak links among functional groups to improve collaboration during recovery plan execution Data/ Information Dissemination Assess how data points or information sets originate or are distributed across the enterprise to their intended targets Identify overlapping information sets and bottlenecks for information dissemination Assess how organization structures or information architecture impact the flow of information to its targets Fraud Detection / prevention Assess the organization or external network to identify communication or collaboration patterns that align with known fraudulent activity Identify network agents that collaborate with known fraudulent agents Identify activities that align with known fraudulent behavior Process Discovery / Improvement Analyze the organization structure and communication patterns to uncover process improvements or identify new processes Identify process improvements through discovery of hidden process steps, communication flows , and actors Discover undocumented or informal processes that are hidden within frequent collaboration and communication paths Supply Chain Analysis Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps, bottlenecks and sourcing strategies Identify communication gaps that could impact dependent process or operations Identify strategic relationships to optimize the supply network Identify supply nodes that create inefficiencies Social Network Analysis

69 Applications Analyze social media networks and consumer feedback to improve product offerings and market interactions Analysis Objectives Novelty/ Sentiment Diffusion Analysis Observe how a specific topic, news articles or sentiment diffuses through a consumer network Assess how target consumers/market will react to a piece of news or campaign Evaluate how long news, data, or sentiment will be retained within a system and how far it will spread Market Influencer Identification Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities Identify individuals or groups that influence markets and adoption Identify untapped markets Identify market segments as targets for ad campaigns to improve product/service adoption Consumer Segmentation Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics Improve product or service offerings based on attributes that connect the consumer market Develop strategies to target new or existing consumers based on identified segmentation characteristics Product or Brand Diffusion Analysis Analyze the flow of communication or ideas through a market segment to evaluate how a product may diffuse Identify segments or individuals that will be likely early adopters Identify incentives or campaigns that will improve product/service adoption Recommendation Systems Analyze consumer network connections and common features among consumers to develop recommendations Identify new feature sets for products and services Assess new markets for selling similar or new products Target consumers with specific products or services Social Network Analysis

70 libSNA, graphTool, networkXTools Social network analysis plug-ins and APIs for development/scripting languages and data analysis tools Tool Overview Network Analysis Network Visual Network Manipulation SNAP A general purpose network analysis and graph mining library for C++ . Link X Statnet A package for R that provides capabilities for social network statistical analysis. Link libSNA, graphTool, networkX Python libraries for network analysis and manipulation. libSNA, networkX, graphTool JUNG Java package for network analysis and modeling. Link NodeXL Excel plug-in that provides an easy to use and interactive interface to explore and visualize networks Link

71 Tools Proprietary and open source social network analysis interactive application suitesOverview Network Analysis Network Visual Network Manipulation GEPHI Interactive open source platform for network analysis and visualization. Gephi X Ucinet Commercial social network analysis tool with separate visualization component. Link Graphviz Open source graph visualization package. Link NetMiner Proprietary package that provides the ability to develop and implement custom algorithms link kxen SNA Network analysis package that provides predictive analytics and customer MDM integration. Link ProM Open source package for mining business process networks. Link Cytoscape Open source tool for network modeling, and analysis. Can connect to external data sources Link Network Workbench Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. Link

72 Resources Tutorials, Tools, Applications, and Research GroupsLink and Description Tutorials Introduction for Beginners Introductory Lecture Paper on Business Applications Network Analysis Process Online Introductory Book Introduction to Network Analysis Application and Theory (Open source book) Research Groups and Papers SNA Group at Stanford: Tool, Lectures, and Papers Complex Networks and Systems Reasearch Collaboration SNA Group and Indiana University: Lectures, Papers, and Tools Reality Mining at MIT Papers from International Conference on Advances in Social Networks Analysis and Mining Tools and Data Sets Wiki List of Social Network Analysis Software Review of 100+ Social Network Analysis Tools List of Tools from The SAGE Handbook of Social Network Analysis More Tool Reviews Twitter Data Sets Web/Blog Data Sets Facebook Data Sets

73 Additional Case Studies

74 Example 1 Advanced natural language processing and deep question-answering technology are being applied to address clinical decision-making Memorial Sloan-Kettering Cancer Center Memorial Sloan-Kettering Cancer Center is applying DeepQA technology (technology that relies on advanced analytics powered by IBM’s Watson) to develop a decision-support application for cancer treatment Doctors will be able to generate and evaluate hypothesis on evidence and treatment and the Cancer Center will be able to better identify and personalize cancer therapies for individual patients WellPoint and Cedars-Sinai WellPoint and the Cedars-Sinai Samuel Oschin Comprehensive Cancer Institute will work together to help improve patient care and support physicians in their efforts to make the most informed, personalized treatment decisions possible. It is estimated that new clinical research and medical information doubles every five years, and nowhere is this knowledge advancing more quickly than in the complex area of cancer care. The WellPoint health care solutions will use DeepQA technology to draw from vast libraries of information including medical evidence-based scientific and health care data, and clinical insights from institutions like Cedars-Sinai. Source: Memorial Sloan-Kettering Cancer Institute Press Release March 2012, WellPoint Press Release, December 2011;

75 Example 2 Large volumes of real-time sensor data are empowering individuals to take more control of their health Quantified Health – P4 Medicine (Predictive, Preventive, Personalized, Participatory) Non-invasive wearable sensors are creating a new ‘Quantified Health’ movement and one of the fastest growing sectors in the tech industry, let alone in the field of Big Data Analytics The number of connected industrial and medical devices is projected to reach 16 billion by 2015 The mHealth market is estimated to reach a value of $23 billion by 2017 Source: Bruce Bigelow, Big Data, Big Biology, and the ‘Tipping Point’ in Quantified Health: Takeaways from Xconomy’s On-the-Record Dinner, Xconomy, April 26, 2012

76 Example 3 Advanced machine learning and visualization techniques are being used to model drug interactions Modeling Adverse Drug Reactions When biological and phenotypic features were integrated alongside chemical structures to predict adverse drug reactions, prediction accuracy increased from to Source: “Liu M, Wu Y, Chen Y, et al . Large-scale prediction of adverse drug reactions by integrating chemical, biological, and phenotypic properties of drugs. J Am Med Inform Assoc 2012;19:e28–35.

77 Other Examples Companies in other sectors are also pursuing various applications of ‘Big Data’ and ‘Smart Analytics’. Satellite Data Hartford Steam Boiler Allianz Location Hartford Steam Boiler is using sensors and real-time sensor data to monitor assets, reduce losses and manage risks better Hartford Steam Boiler has been able to manage concentration risks and reduce losses, having one of the lowest combined ratios for a commercial insurer Allianz is ‘mashing’ satellite data, third-party street-level data, images, and other internal data to better understand risk concentrations and manage concentration risk in commercial property insurance Map Data Property-Specific Data Proctor & Gamble Proctor & Gamble is investing in analytics talent for quicker decision making, with the CIO planning to increase fourfold the number of staff with expertise in business analytics Executives are currently using big data to uncover what is currently going on in their business, to understand why, to predict future performance and to understand what actions P&G should take Source: “Procter & Gamble – Business Sphere and Decision Cockpits”, Ravi Kalakota, Pratical Analytics Wordpress, Feb. 2012, mskcc.org/cancer-care; eWeek.com, Healthcare IT News, IBM Watson to Aid Sloan-Kettering With Cancer Research, March 2012

78 Big Data Analytics Technology & Vendor Mappings

79 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 1. Infrastructure Cloud Private EMC Private Cloud EMC HP Private Cloud HP Teradata Private Cloud Teradata Dell Private Cloud Dell Public Azure SQL Microsoft Amazon Web Services Amazon Google Cloud Platform Google Hybrid EMC Hybrid Cloud HP Helion IBM Hybrid Cloud IBM

80 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 3. Data Ingestion & Integration Data Acquisition Batch/Micro Apache Kafka Apache Software Foundation Fluentd Open Source Sqoop Rabbit MQ AWS Kinesis Amazon Web Services Apache Spark Real time/ Streaming Apache Storm Apache Spark Streaming Samza NiFi

81 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 3. Data Ingestion & Integration Data Quality Data Profiling/Cle-ansing *Need assistance in locating Data Matching/D-uplication Standardizati-on/Normaliz-ation Data Integration ETL/ELT Hadoop Apache Hadoop Talend Hive Apache Software Foundation Drill Staging Persistent Staging File Exchange File Storage

82 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 3.5. Execution/ Data Processing Execution/Data Processing Custom Compliers *Need assistance in locating Batch MapReduce MapReduce Apache Hadoop Spark Apache Software Foundation AWS EMR Amazon Web Services Tez In-Memory Processing Resource Managem-ent Computing Framework Cluster Management YARN Mesos Zookeeper Oozie

83 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 3.5. Execution/ Data Processing Resource Managem-ent Workflow Management Hue Open Source Ambari Apache Software Foundation Lipstick Netflix Ganglia The Ganglia Project 4. Data Repositories Relational Database Traditional Database SQL Server Microsoft Oracle 10g Oracle Parallel Database Teradata Data Appliances HP Vertica HP IBM BigInsights IBM EMC Greenplum EMC NewSQL ClustrixDB Clustrix Mem SQL Memsql Distributed File System Hadoop DFS HDFS Apache Hadoop AWS Amazon Web Services Packaged Solutions Tachyon Tachyon Project ODS Operational Data Store

84 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 4. Data Repositories In-Memory Relational/ NewSQL MySQL Open Source PostgreSQL AWS RDS Amazon Web Services Columnar DB Cassandra Apache Software Foundation Hbase Apache Hadoop AWS Redshift NoSQL Hazelcast Aerospike Metadata Storage *Need assistance in locating

85 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 4. Data Repositories NoSQL Key Value Redis Open Source Riak Basho AWS DynamoDB Amazon Web Services Column Store Cassandra Apache Software Foundation Hbase Apache Hadoop AWS Redshift Graph Database Neo4j Neo Technology OrientDB Orient Tehcnologies ArangoDB Document Database MongoDB MongoDB, Inc. Elastic Couchbase

86 Big Data Analytics – Technology & Vendor MappingsLayer L1 L2 Technology Vendor Logos 6. Presentation/ Data Visualization Reporting & Dashboar-ds *Need assistance in locating Microstrategy Datameer Visualizati-on tools/ Interactive Visual Analytics Qlik Sense Qlick Tableau Real-time Alerts Website Front-end D3 Open Source Angular JS Google Flask Highcharts Django Django Software Foundation API