1 Take Me to Your Data Why Data Stewardship is needed NOW!Boston Code Camp - Nov 19, 2016 SQL Saturday Providence – Dec 10 Beth Wolfset, Data Architect Contents herein are confidential and intended for the original recipients only.
2 About BlueMetal (an event sponsor)Modern technology, craftsman quality. We’re an interactive design and technology architecture firm matching the most experienced consultants in the industry to the most challenging business and technical problems facing our clients. Founded August 2010 and as of October 2015 we are an Insight company. 6 | YEARS IN OPERATION 5 | LOCATIONS 6 | SERVICE AREAS 4 | INDUSTRY SPECIALIZATIONS Locations: Boston, New York, Chicago, DC, Tempte Service Areas: Intelligent Customer Applications Modern Business Applications Real-Time Business Hybrid Cloud Modern Workplace Branch Infrastructure
3 Data Is An Asset “Whether you want it or not, the amount and variety of data are expanding exponentially. Embrace that trend and transition your organizations to understand information as a competency that needs the right people, processes and platforms” John Lewis, president & CEO, consumer group, NA, at Nielsen “organizations integrating high-value, diverse, new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20%.” Regina Casonato, et al, Gartner Research The Big Mystery: What’s Big Data Really Worth? - Vipal Monga, 2014 A Lack of Standards for Valuing Information Confounds Accountants, Economists What's Your Big Data Worth? - Ellis Booker, 2012 Big data experts say accounting rules need to catch up to the fact that information has value that should be reflected on a company's books Information Management in the 21st Century - Regina Casonato, Anne Lapkin, Mark A. Beyer, Yvonne Genovese, Ted Friedman, 2011 From "Information as a Byproduct" to "Information as an Asset" https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwj1jrmE76DQAhUJ5GMKHRctAqoQFgggMAA&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2F7%2FB%2F8%2F7B8AC B65-B1B3-0B523DDFCDC7%2FBig%2520Data%2520Gartner%2520information_management_in_the_21st%2520Century.pdf&usg=AFQjCNGeRAXrEH0knMDl_FaOJjSSqSCjyw&sig2=SBYNxImVvkOSkZevXItifw&cad=rja Gartner Says Worldwide Enterprise IT Spending to Reach $2.7 Trillion in 2012 – Peter Sondergaard, 2011 “Information is the oil of the 21st century, and analytics is the combustion engine.” Peter Sondergaard, Gartner Research
4 Topics Data Anarchy Use Cases Data Governance People – Data StewardsProcess Artifacts Take me to your Data!
5 The Data Landscape Transactional Source systems ETL Data WarehouseSQL Server Reporting Services (SSRS) SQL Server SQL Server Integration Services (SSIS) SQL Server Analysis Services (SSAS) Power BI Transactional Source systems ETL Staging Data Warehouse BI & Analytics OLTP ERP CRM LOB Historical Reporting Dashboards IoT Data Slide is a modification of Microsoft Slide that is found in several presentations : https://mva.microsoft.com/en-US/training-courses/big-data-analytics-with-hdinsight- hadoop-on-azure-10551?l=WZjfYu97_ Predictive Analytics from dreamstime.com Logs Predictive Analytics Devices Web Sensors Social Azure Machine Learning (AML) HD Insight Azure Data Lake Microsoft Azure
6 Popular Types of DatabasesDatabase Type Example Relational (SQL) SQL Server SQL Database Oracle Sybase MySQL MS Access Document Lotus Notes CouchBase CouchDB MongoDB OrientDB Raven Terrastore Azure DocumentDB Graph & Resource Description Framework (RDF) Neo4J Flock HyperGraph Infinite Graph Jena Sesame AllegoGraph Search Engine ElasticSearch Splunk Solr MarkLogic Sphinx Key-Value Berkely Level Memcached Riak Redis Azure Tables Column-Family Cassandra HBase Hypertable Amazon Simple DB Object-Oriented Cache ObjectStore Objectivity/DB Db4o Versant Hierarchical Data Stewards work with data regardless of where and how it will be persisted. Most information from the book: No SQL Distilled Other data from site: And from: Understanding NoSQL on Microsoft Azure by David Chappell Products offered by Microsoft
7 How many orders were placed yesterdayCurrent Issues AHEAD CHAOS Tribal Knowledge Bad or missing data Inconsistent definitions and usage across the business Duplicated efforts Inappropriate and unmanaged access Non-compliance Data Hoarding How many orders were placed yesterday Coding can be agile, only IT personnel would access directly. Data needs governance and stewardship When you have a question about what data is available or where to get it, where do you go? A person? With storage becoming cheaper and new ways to analyze data, Companies are becoming Data Hoarders https://www.datanami.com/2016/06/13/growing-menace-data-hoarding/
8 Use Case: Common Data ObjectsAcross systems, similar objects are manifested with different data structures. Events Taxonomies & Reference Data Business Objects Logs Auditing Utilization Demographic Data When the same objects is represented different ways, it is difficult to get a cross functional look In a transactional system, these are areas for master data management In a warehouse, these will be transformed before storing Predictive analytics efforts may have to perform manipulation
9 Use Case: Master Data Managementcore data that is essential to operation of the business consistent and uniform set of identifiers and extended attributes that describes the core entities Master Data Management a methodology that identifies the most critical information within an organization—and creates a single view of truth to power business processes discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets may be technology enabled Master Plan List tCustomer Customer Config Sales Customer Data Three Ways to Master Data Mutually Exclusive Vertically Fragmented Match and Merge Name: Severus Snape SSN: Address: 9 Galen St Phone: Name: S. Snape Degree: Engineering Name: Prof. Snape Emp Id: 456 Name: Prof. Severus Snape SSN: Emp Id: 456 Address: 9 Galen St Phone: Degree: Engineering Which is the right customer address? Mastering is always about having disparate sources of data and brining it together. Three types of mastering: Mutually exclusive – the data is different sources but there is no overlap. Mastering brings the sources into a single ‘list’ with a common structure. Example: class list across departments, vendor contract data Vertically fragmented – the data is in different sources and there are attributes of the data that are different in these sources. Mastering creates a single view of the data that appears as if all attributes are in one master record. It is important to identify the source of the data by attribute in this type of mastering. Only one application can create a new record. However, different applications may be able to update – but only the attributes they are the source for. Example: most likely the plan list. Match and merge – the data is in different sources and there is overlap of both rows of data and attributes within the data. Therefore, to master the data and provide a single view a complicated set of understanding the sources and attributes, ranking the owner, and recognizing the same information that arrives from those sources differently. These are the match and merge rules. Example: Provider data master. Why Do MDM Portal Provides single logical view of data that is consistent and trustworthy information Users see same data consistently across all applications and all plan Efficiency Reduces overhead Integrate data Compliance Facilitates industry pressures and government mandates Allows data integrity for data object Reuse - Expedites computing in multiple systems, architectures, platforms and applications Scalability - Support the projected growth Engineering Classes Math Classes Philosophy Classes Master Class List
10 Complete, Clean, Consistent and Current DataUse Case: Enterprise Information Management Integration Services Master Data Services Data Quality Services Primarily designed to implement ETL processes Provides a robust, flexible, fast, scalable and extensible architecture Master data management Manages reliable, centralized data Broadens its reach with a new Excel Add-in that can leverage Data Quality Services Knowledge-driven data cleansing Corrects and de-duplicates data Integrates with Integration Services Overview: SQL_Server_2012_Enterprise_Information_Management_Whitepaper.pdf by Graeme Malcolm Master Data Services (MDS) enables your organization to manage a trusted version of data SQL Server Master Data Management solution first available in 2008 R2, improved in 2016 Create a (logical) data model for master data (under Shared Features) Create validation and business rules (can be defined in Excel) with workflow Allows for load of data from various data sources Allows manual modification of trusted version Versioning Security Notifications Sharepoint integration Requires: SQL Server Database: separate instance from sources - Includes staging tables and master data tables Web front end Good Reading Peter Myers End-to-End SQL Server Master Data Services - sqlSaturday393_E2E-MDS.pdf Jeremy Kashel https://www.sqlbits.com/Sessions/Event7/end_to_end_master_data_management_with_sql_server_master_data_services Data Quality Service (DQS) provides a knowledge-based approach to managing data quality. Allows customers to cleanse, match, standardize, and enrich data to make sure it is accurate, consistent, and complete Complete, Clean, Consistent and Current Data
11 Use Case: Microsoft Power BIData sources Power BI service Content packs Natural language query Sharing & collaboration SaaS solutions E.g. Marketo, Salesforce, GitHub, Google analytics Live dashboards On-premises data E.g. Analysis Services Visualizations Organizational content packs Corporate data sources, or external data services Reports Every morning show me order status Azure services E.g. Azure SQL, Stream Analytics Datasets 01001 10101 Excel files Workbook data or data models Power BI is taking Excel power pivot and power query and ‘enterprising’ it The tabular model is another example of enterprising power pivot – the tabular model is a power pivot placed in the analytic services. SSAS provides scheduling capabilities to automate loading data through Stored Procedures. Validation can be performed which is more difficult in Excel. Power BI Desktop files Related data from files, databases, Azure, and other sources Data refresh
12 What is Data GovernanceA process that defines the handling of data and information practices. It defines rules for the creation, access and modification of the data. It describes how to identify and resolve issues arising from non-compliance. Process People Artifacts Which list of customers is correct? DATA Data governance (DG) is a formalized endeavor to manage the availability, usability, integrity, and security of the data employed in an enterprise. A sound data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures. - Martha Dember and April Reeve Data Governance brings together cross-functional teams to make interdependent rules or to resolve issues or to provide services to data stakeholders. These cross-functional teams – Data Stewards and/or Data Governors – generally come from the Business side of operations. They set policy that IT and Data groups will follow as they establish their architectures, implement their own best practices, and address requirements. Data Governance can be considered the overall process of making this work. - Data Governance is strategic, Data Stewardship is tactical (daily and routine care) Decision governance is about instilling a discipline for 1) measuring data-based decisions at the individual and organization levels, 2) managing and growing the organization’s decision-making assets (data, analytic models, decision-delivering capabilities) and 3) growing the decision-making effectiveness of the organization and culture.- https://infocus.emc.com/william_schmarzo/data-governance-on-the-road-to-decision-governance/ Data Governance Guardian of the data Permissions to access for functionality and data Data integration rules Validation of data handling CHAOS
13 Process: Getting Started“It’s easier to ask forgiveness than it is to get permission” -- Grace Hopper Permission First Forgiveness Later Data Stewards CIO Management People? Regular meetings Time? Competing Concerns Executive Pain Points Corporate Risks Action Items Current Data Issues Build Successes Licensed Products Technology? In-House Public Domain Management Approves Priorities? Team Preference How much is this going to cost me? 14 percent of your invoices are delayed due to outdated customer addresses Who has access to the financial data of your customers and ask what would happen if that data were breached Getting Permission First Verify CIO understands Conduct separate presentations for executives across business Focus on their current issues and risks and how governance will solve Explain how their own team members would be involved Show components that already exist Promote a phased approach Allow management to set the priorities
14 Process: Data GovernanceGraphic from: https://www.collibra.com/blog/crowdsourcing-data-governance/ The data governance process or framework needs to enable an organization to appropriately data Governance Council - Experts Team of data stewards. These are regular business data users as well as IT SMEs Meet regularly Determined agenda Crowdsourcing Everyone has rights as a data citizen Requires self-service technology with appropriate clean and control tools
15 Process: Data Governance OrganizationGovernors IT (DA) ~ Business Business Finance Contracts/ Legal Customer Service Sales Marketing IT BI DA/DBA/ ETL Experts
16 What is Data GovernanceA process that defines the handling of data and information processes. It defines rules for the creation, access and modification of the data. It describes how to identify and resolve issues arising from non-compliance. Process People Artifacts I want the same answer no matter who I ask? DATA Data Governance Guardian of the data Permissions to access for functionality and data Data integration rules Validation of data handling CHAOS
17 People: Delivering Tangible Benefits“…only business users close to the content can evaluate information in its business context” -- Gartner Challenges Requirements Information Workers Spend too much time searching for data Excessive efforts to prepare data for use Reduces time to actually analyze data Seamlessly find and access relevant data Easily enrich data to make it useable Deliver annotated findings Data Stewards The lack of trust in information continues as a significant inhibitor to businesses Improve quality, usefulness and discoverability of data Promote the correct usage of trusted data Foster community of productive data users Slide adapted from: pugetsound.sqlpass.org/.../ %20Matthew%20Roche%20Power%20BI.pptx Affects midsize to very large organizations IT Professionals IT spends too much time and resources servicing data requests from the business while trying to secure and govern data access and use Balance self-service data discovery for the business with IT need for visibility and control Reduce human and infrastructure resources required for data discovery and enrichment
18 People: The Data Steward“ I’m a business subject matter expert, sitting in IT or LOB as a liaison between the two. Depending on the size and type of the business, I may do part of someone else’s job (e.g. Anna or Vicki). Accountabilities Skills Making data useful to the business Consistent use of data across the business Promoting and achieving high data quality standards Resolving data integrity issues across stakeholders 5+ years of industry experience Proficient with Office (Excel, Word, PowerPoint). Can learn to use Power Pivot Understands data relationships, data process flows. May know SQL. ” Perspectives Work Activities Process and detail oriented with great organizational skills Prides himself on his creative resourcefulness, passion for quality and great interpersonal skills A ‘de facto’ steward because of deep industry expertise and understanding of his organization’s data sources Analyzes data for quality (particularly as part of BI work), reconciles data issues Identifies and acquires new data sources Actively analyzes data for ‘semantic’ quality Drives resolution of data integrity issues across business and technical stakeholders. Leads and / or participates in MDM / EIM / DQ initiatives Creates and maintains business metadata, references data values and meanings, and / or master data values and meanings Slide Story: Stewart is a Microsoft persona, focus on establishing context with the audience This is a Microsoft Slide in several presentations, usually done by Matthew Roche. Matthew is a Senior Program Manager a has a number of videos on MVA and Channel 9 in the Data Architecture realm. Slide adapted from: pugetsound.sqlpass.org/.../ %20Matthew%20Roche%20Power%20BI.pptx Stewart Data Steward Provisions & distributes high quality data Source: pugetsound.sqlpass.org/.../ %20Matthew%20Roche%20Power%20BI.pptx
19 People: Data Steward and Schema TypesSchema-on-write Implies a structured database (not necessarily relational) Data structure determined prior to data storage Schema-on-read Implies a data set Data may be stored in methods that do not require the structure to be understood a priori Structure of data is defined at query time Data Steward Understands what data is available and how to get it Data requires documentation Data is data. Turning it into useful information requires the same skills regardless if the data is structured ~ schema-on-write OR unstructured ~schema-on-read
20 What is Data GovernanceA process that defines the handling of data and information processes. It defines rules for the creation, access and modification of the data. It describes how to identify and resolve issues arising from non-compliance. Process People Artifacts DATA Data Governance Guardian of the data Permissions to access for functionality and data Data integration rules Validation of data handling CHAOS
21 Data Governance ArtifactsBusiness Glossary / Enterprise Data Dictionary Analysis Products Data Management Security Data Cleanup / Purge / Archiving Information Infrastructure Education Resource Recommendation DB Release Management Protocol Success Measures How do I know this is working? Business Glossary/Enterprise Data Dictionary Naming Standards Definitions Data Management Master/Common Data Change requests Relational and Non-Relational Data Data Analysis Products Metric Definition, Harmonization & Adoption Reports Repository Predictive Analytics Data Security Guidelines Data access Privacy classification: Public, Company Confidential, PII, PHI Data Cleanup / Purge / Archiving Information Infrastructure Education Data Owners Availability of Data Notification data users Resource Recommendations: personnel and skills DB Release Management Protocol Testing of common/MDS Promotion of changes and versioning Success Measures Compliance reports Validation Metrics Monitoring Presentations to management team
22 Validating the Output ArtifactsReports Predictive Models Prioritization Reports RunDate Data accurate as of Who uses the report Analysis Hypotheses and business value When was product created What criteria were considered: dates, data sources, data population What assumptions were made Statistical principals: sampling size, train vs test size, statistical model, significance Vetting approach and methodology What were the findings How, when and to whom were the findings presented When/Event or why should the analysis be repeated (findings become suspect) Data Analysis
23 Data Modeling Tools Tool Creator Supported Database Platforms Supported OSs Supported data models (conceptual, logical, physical) Supported notations Forward Engineering Reverse Engineering Model/database comparison and synchronization Repository ERwin Data Modeler ERwin Inc. (formerly part of CA Technologies) Access, IBM DB2, Informix, Ingres, MySQL, Oracle, Progress, MS SQL Server, Sybase, Teradata Windows Conceptual, logical, physical IDEF1X, IE (Crows feet), and more Yes Update database and/or update model Workgroup edition provides collaboration ER/Studio Embarcadero (acquired by IDERA) Access, IBM DB2, Informix, Hitachi HiRDB, Firebird, Interbase, MySQL, MS SQL Server, Netezza, Oracle, PostgreSQL, Sybase, Teradata, Visual Foxpro and others via ODBC/ANSI SQL Conceptual, logical, physical, ETL IDEF1X, IE (Crows feet) ER/Studio Repository and Team Server (formerly Portal/CONNECT) for collaboration Enterprise Architect Sparx Systems IBM DB2, Firebird, InterBase, Informix, Ingres, Access, MS SQL Server, MySQL, SQLite, Oracle, PostgreSQL, Sybase Windows, Linux, Mac Conceptual, Logical & Physical + MDA Transform of Logical to Physical IDEF1X, UML DDL, Information Engineering & ERD Multi-user collaboration using File, DBMS or Cloud Repository (or transfer via XMI, CVS/TFS or Difference Merge). SQL Server Management Studio Microsoft MS SQL Server Physical Oracle SQL Developer Data Modeler Oracle Oracle, MS SQL Server, IBM DB2 Cross-platform Logical, physical PowerDesigner Sybase MS SQL Server, Oracle, PostgreSQL, MySQL, IBM DB2, Informix Information reprinted from: https://en.wikipedia.org/wiki/Comparison_of_data_modeling_tools This is a subset of Data Modeling tools on the market. Erwin and ER Studio are listed first as they are the most commonly used tools and encompass full features. Oracle SQL Developer is based on the Oracle Designer product and was scaled for Oracle JDeveloper. It is now free and part of the SQL Developer product. PowerDesigner was quite popular at one time. Enterprise Architect is embedded with the EA suite, however does not include full features of other data modeling tools. SQL Server Management Studio allows diagrams to be developed, or a model reverse engineered from a reporting db. It does not encompass many other features. Visio does allow for the development of ER diagrams in the Professional edition when the ‘data’ option is included. It does not include the features of other data modeling tools. Entity Framework has tooling to allow the creation of a Entity Data Model. This is minimal functionality for a data model tool and in theory supports the physical model only.
24 Metadata Management SolutionsGartner: Data Tools Left: Magic Quadrant for Data Quality Tools November 18, 2015 https://www.gartner.com/doc/reprints?id=1-2T6O9CP&ct=151202&st=sb Right: Magic Quadrant for Metadata Management Solutions August 15, 2016 https://www.informatica.com/metadata-management-magic-quadrant.html#fbid=hy-CsQ_CB0v https://www.gartner.com/doc/reprints?id=1-3I7OZ5P&ct=160922&st=sg&utm_medium= &_hsenc=p2ANqtz-8x9X3VSpBudrkFup-3BpQbdsaHzOR7TBVsGOdzXPDCsQvwB_UpHoMZE7PusZIzI3l93E2tDPEH3uVLWcr6kMCMKj6WUA&_hsmi= &utm_content= &utm_source=hs_automation&hsCtaTracking=1ad9bb9f-70ca b a681%257Ce08ecb82-0ee bad772d7c21d Data Quality Tools Metadata Management Solutions
25 Complete, Clean, Consistent and Current DataUse Case: Enterprise Information Management Integration Services Master Data Services Data Quality Services Primarily designed to implement ETL processes Provides a robust, flexible, fast, scalable and extensible architecture Master data management Manages reliable, centralized data Broadens its reach with a new Excel Add-in that can leverage Data Quality Services Knowledge-driven data cleansing Corrects and de-duplicates data Integrates with Integration Services Overview: SQL_Server_2012_Enterprise_Information_Management_Whitepaper.pdf by Graeme Malcolm Master Data Services (MDS) enables your organization to manage a trusted version of data SQL Server Master Data Management solution first available in 2008 R2, improved in 2016 Create a (logical) data model for master data (under Shared Features) Create validation and business rules (can be defined in Excel) with workflow Allows for load of data from various data sources Allows manual modification of trusted version Versioning Security Notifications Sharepoint integration Requires: SQL Server Database: separate instance from sources - Includes staging tables and master data tables Web front end Good Reading Peter Myers End-to-End SQL Server Master Data Services - sqlSaturday393_E2E-MDS.pdf Jeremy Kashel https://www.sqlbits.com/Sessions/Event7/end_to_end_master_data_management_with_sql_server_master_data_services Data Quality Service (DQS) provides a knowledge-based approach to managing data quality. Allows customers to cleanse, match, standardize, and enrich data to make sure it is accurate, consistent, and complete Complete, Clean, Consistent and Current Data
26 Azure Data Catalog Provides discovery and registration of a variety of data sources Enterprise Data Dictionary using crowdsourcing Business Glossary allowing for hierarchical taxonomies Azure Data Catalog is an evolution of the Office 365 Data Catalog Allows direct access to data Requires: Microsoft Account (work or school account) Azure Subscription Azure Active Directory (with Forms Authentication enabled in the Global Authentication Policy) Diagram from: https://azure.microsoft.com/en-us/documentation/articles/data-catalog-what-is-data-catalog/
27 Thank you. Beth Wolfset [email protected]We appreciate your interest, and look forward to working with you in the future! Beth Wolfset Nice report. Now can you add …. Data Governance Guardian of the data What is collected, how and why Definitions Permissions to access for functionality and data Data integration rules Transformation Integrity Validation of data handling
28 Data, Data, Everywhere Transactional – man-made river – structured data designed for a single purpose. Operational Data Store – river mouth – transactional data from multiple sources brought together in real-time to support additional transactional uses including reporting. Reporting - reservoir – structured data designed for a purpose (reporting). Real-time streaming – unstructured operational data from multiple devices Data Warehouse – ocean – structured data from multiple sources, that is usually aggregated and refreshed with some periodicity Data Mart – a bottle of water or man-made pond, cleansed and packaged for easy consumption Data Lake – contents from various sources and formats where users can come to examine, dive in, or take samples. Data is usually stored in its raw format. Business Glossary/Enterprise Data Dictionary Naming Standards Definitions Data Management Master/Common Data Change requests Relational and Non-Relational Data Data Analysis Products Metric Definition, Harmonization & Adoption Reports Repository Predictive Analytics Data Security Guidelines Data access Privacy classification: Public, Company Confidential, PII, PHI Data Cleanup / Purge / Archiving Information Infrastructure Education Data Owners Availability of Data Notification data users Resource Recommendations: personnel and skills DB Release Management Protocol Testing of common/MDS Promotion of changes and versioning Success Measures Compliance reports Validation Metrics Monitoring Presentations to management team