Chapter 11: Big Data and Analytics

1 Chapter 11: Big Data and AnalyticsModern Database Manag...
Author: Jewel Goodwin
0 downloads 2 Views

1 Chapter 11: Big Data and AnalyticsModern Database Management 12th Edition Global Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi 授課老師:楊立偉教授,台灣大學工管系

2 Big Data Analytics IntroductionData that exist in very large volumes and many different varieties (data types) and that need to be processed at a very high velocity (speed). Analytics Systematic analysis and interpretation of data—typically using mathematical, statistical, and computational tools—to improve our understanding of a real-world domain. This chapter is primarily about these two new advances in data technologies and approaches. Traditional data management technologies were created to ensure accurate and efficient transaction processing. As we saw from chapter 9, later database structures were created to support decision-making and overall understanding of the business. We called these data warehouses. Big data and analytics takes us further down this road.

3 Characteristics of Big DataThe Five Vs of Big Data Volume – much larger quantity of data than typical for relational databases Variety – lots of different data types and formats Velocity – data comes at very fast rate (e.g. mobile sensors, web click stream) Veracity – traditional data quality methods don’t apply; how to judge the data’s accuracy and relevance? Value – big data is valuable to the bottom line, and for fostering good organizational actions and decisions

4 Characteristics of Big DataSchema on Read, rather than Schema on Write Schema on Write– pre-existing data model, how traditional databases are designed (relational databases) Schema on Read – data model determined later, depends on how you want to use it Capture and store the data, and worry about how you want to use it later Data Lake A large integrated repository for internal and external data that does NOT follow a predefined schema Capture everything, dive in anywhere, flexible access Big data processing is done in batch mode, and may take hours to sift through the huge volume of data for an analytical task. Big Data processing via Hadoop is good for long analytical tasks but not so good for exploring individual cases and their relationships. For this, relational databases and data warehouses are better.

5 Figure 11-2 Schema on write vs. schema on readTraditional database design The big data approach Collect and store everything into the data lake. Worry about what you’re going to do with it later. That’s the big data approach. There are lots of ways to structure data and you can do it after the fact, as long as the data obeys certain data standards (like XML or JSON).

6 Figure 11-1 Examples of JSON and XMLJavaScript Object Notation eXtensible Markup Language Both JSON and XML provide a semi-structured means of representing data hierarchically. JSON is also a major component of the JavaScript language. We talked a lot about XML in chapter 8.

7 NoSQL Database NoSQL = Not Only SQL (most also support SQL)A category of recently introduced data storage and retrieval technologies not based on the relational model Scaling out rather than scaling up For a cloud environment Largely open source Supports schema on read BASE (basically available, soft state, eventually consistent) instead of ACID properties

8 NoSQL ClassificationsE-R model都做得到;好比是 特定目的之強化版DBMS NoSQL Classifications Key-value stores 支援鍵與值的(快速)存儲 A simple pair of a key and an associated collection of values. Key is usually a string. Database has no knowledge of the structure or meaning of the values. Document stores 支援文件的存儲 Like a key-value store, but “document” goes further than “value”. Document is structured so specific elements can be manipulated separately. Wide-column stores 支援紀錄與欄的分散存儲 Rows and columns. Distribution of data based on both key values (records) and columns, using “column groups/families” Graph-oriented database 支援網狀資料的(快速)存儲 Maintain information regarding the relationships between data items. Nodes with properties, Connections between nodes (relationships) can also have properties.

9 Figure 11-3 Four-part figure illustrating NoSQL databasesAs you can see, these are very different from relational databases. But you still have the notion of a unique entity, identified by its key, and the notion that this entity has attributes. Also remember from chapter 5 that key-value storage is nothing new. Remember hashed file organizations?

10 NoSQL Comparison NoSQL Examples Redis – Key-value store DBMSMongoDB – document store DBMS Apache Cassandra – wide-column store DBMS Neo4j – graph DBMS NoSQL brings an exciting new dimension to the database profession. Note that “NoSQL” does not mean “No SQL”. It means “Not Only SQL”. SQL will be a part of database systems for some time to come.

11 Hadoop Hadoop is an open source implementation framework of MapReduceMapReduce is an algorithm for massive parallel processing of various types of computing tasks Hadoop Distributed File System (HDFS) is a file system designed for managing a large number of potentially very large files in a highly distributed environment Hadoop is the most talked about Big-Data data management product today Hadoop is a good way to take a big problem and allow many computers to work on it simultaneously A key feature of big data is highly distributed processing. The internet allows you to do this, by sharing tasks across hundreds or thousands of computers. Hadoop is made up of MapReduce and HDFS. Hadoop’s essence is in processing very large amounts (terabytes or petabytes) of data by distributing the data and processing tasks among a large number of low-cost commodity servers. That means plain old PCs and servers.

12 Figure 11-6 Schematic representation of MapReduceMapper Reducer In this example, the form of this data is in key-value pairs. So, the keys in the figure are k1, k2, etc. Assume that the problem is to get an overall count of how many times the key appears in the data. For example, consider that the keys are customer IDs and the values are individual orders for each customer. So, we’re trying to figure out how many orders each customer made. Also assume that the original input data is very, very, very large. In other words, bigger than what a single relational DBMS can handle. The original input data is received at the master and then divided into subsets, each of which goes to an individual slave node. So, each slave node gets some number of sales orders. This is the map process. There may be hundreds of these nodes. Each node works on its subset of the data. Its task is to find out how many orders for each customer. So for example, the node at the bottom found 3 k1 orders and 2 k5 orders. Note that multiple nodes may have found examples of the same key. For instance, k2 is found in two of the nodes. The shuffle step is at the end of mapping, and merges all the map outputs and packages them for the reduce phase. At this point, we’ve collected all results and grouped them by key value. The reduce stage integrates the results of each of the map processes, creating the final result It is up to the developer to define the mapper and the reducer so that they together get the work done. So, MapReduce programming involves writing code to do the mapping and writing code to do the reducing.

13 Integrated Analytics and Data Science PlatformsSome vendors are bringing together traditional data warehousing and big data capabilities Examples HP HSAVEn – Hewlett Packard technologies combined with Hadoop open source and an analytics engine Teradata Aster – integrate SQL, graph analysis, MapReduce, R IBM Big Data Platform – combine IBM technologies with Hadoop, JSON Query Language (JAQL),DB2, Netezza

14 Analytics Historical precedents to analytics:Management information systems (MIS)  Decision Support Systems (DSS)  Executive Information Systems (EIS) DSS idea evolved into Business Intelligence (BI) Business Intelligence – a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information. Analytics encompasses more than BI Broader term that includes BI Transform data to useful form Infrastructure for analysis Data cleanup processes User interfaces The IT field is full of jargon and buzzwords whose names change over time. In one book or article, you may read that BI is an umbrella term and includes analytics as a component. Others describe them in exactly the opposite relationship. And some say that BI and analytics are exactly the same thing.

15 Types of Analytics Descriptive analytics – describes the past status of the domain of interest using a variety of tools through techniques such as reporting, data visualization, dashboards, and scorecards Predictive analytics – applies statistical and computational methods and models to data regarding past and current events to predict what might happen in the future Prescriptive analytics –uses results of predictive analytics along with optimization and simulation tools to recommend actions that will lead to a desired outcome With descriptive analytics we ask what happened already (last week, last year, etc.). With predictive analytics we ask what’s going to happen in the future and how will it affect us. With prescriptive analytics, we ask what is the best decision to make.

16 Figure 11-11 Generations of Business Intelligence and AnalyticsAdapted from Chen et al., 2012 BI&A 1.0 Focus on structured quantitative data largely from relational databases BI&A has evolved along with other elements of information technology. It should be noted that there is far more unstructured and semistructured data (characteristic of Web and mobile technology) than there is structured data (typically found in relational databases). And although all data (structured and unstructured alike) is increasing in volume over time, the rate of growth is largest in the unstructured space. BI&A 2.0 Include data from the Web (web interaction logs, customer reviews, social media) BI&A 2.0 Include data from mobile devices, (location, sensors, etc.) as well as Internet of Things

17 Use of Descriptive AnalyticsDescriptive analytics was the original emphasis of BI Reporting of aggregate quantitative query results Tabular or data visualization displays Dashboard – a few key indicators Scorecard – like a dashboard, but broader range OLAP – online analytical processing

18 Online Analytical Processing (OLAP) ToolsOnline Analytical Processing (OLAP) -- the use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques Relational OLAP (ROLAP) – OLAP tools that view the database as a traditional relational database in either a star schema or other normalized or denormalized set of tables Multidimensional OLAP (MOLAP) –OLAP tools that load data into an intermediate structure, usually a three- or higher-dimensional array. OLAP is often (but not always) associated with multidimensional database structures, otherwise known as cubes. The next slide shows a figure of a cube. Cube Slicing, Drill down (Roll up), Pivoting, etc.

19 Figure 11-12 Slicing a data cubeThis is a three dimensional cube Each cell in the cube pertains to a particular measure of a particular product of a particular month. When you slice a cube, you reduce the number of dimensions by focusing on one “slice” of a particular dimension. In this case, we focused on shoes. So, the two dimensional result is to see, for each month, the different measures (units, revenue, and cost). Slicing, dicing, pivoting, and drill-down are useful cube operations

20 Figure 11-13 Example of drill-down Summary reportStarting with summary data, users can obtain details for particular cells. Drill-down with color added The term drill-down refers to going from higher levels of aggregation (summary information) to more detail. You could also call it “disaggregation”. Executing a drill-down (as in this example) may require that the OLAP tool “reach back” to the data warehouse (or even the operational data) to obtain the detail data necessary for the drill-down. Automating this type of operation in an OLAP tool requires an integrated set of metadata.

21 Figure 11-14 Sample pivot table with four dimensions:Country (pages), Resort Name (rows), Travel Method, and No. of Days (columns) This figure shows a schematic of what the user interface for a cube might look like. Often, the user interface is done in spreadsheet software like Excel. Cubes can display as many dimensions as you want. And the user interface also supports pivoting. So, for example, it would be simple to switch the view such that resort name would appear in the columns and travel method in the row, or to bring countries into the columns and number of days into the pages.. Although the screen is only two dimensions, you can include more dimensions by combining multiple in a row or column, and by including paging

22 Data Visualization Representation of data in graphical and multimedia formats for human analysis “A picture tells a thousand words” Without showing precise values, graphs and charts can depict relationships in the data Often used in dashboards, as shown in next slide

23 Business Performance Mgmt (BPM)Figure 11-16 Sample Dashboard BPM systems allow managers to measure, monitor, and manage key activities and processes to achieve organizational goals. Dashboards are often used to provide an information system in support of BPM. Charts like these are examples of data visualization, the representation of data in graphical and multimedia formats for human analysis.

24 Predictive Analytics Statistical and computational methods that use data regarding past and current events to form models regarding what might happen in the future Examples: classification trees, linear and logistic regression analysis, machine learning, neural networks, time series analysis, Bayesian modeling Here are some examples of questions that might be answered through predictive analytics: • What type of an offer will a specific prospective customer need so that she/he will become a new customer? • What solicitation approaches are most likely to lead to new donations from the patrons of a non-profit organization? • What approach will increase the probability of a telecommunications company succeeding in making a household switch to their services? • What will prevent an existing customer of a mobile phone company from moving to another provider? • How likely is a customer to lease their next automobile from the same company from which they leased their previous car? • How profitable is a specific credit card customer likely to be during the next five years?

25 Data Mining Tools Knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics Goals: Explanatory – explain observed events or conditions Confirmatory – confirm hypotheses Exploratory –analyze data for new or unexpected relationships Text mining – Discovering meaningful information algorithmically based on computational analysis of unstructured textual information Traditional data mining works with structured data, such as numerical and categorical data, that you would typically find in relational databases. But with the advent of the web and social media, there is a huge amount of unstructured data in the form of text entries in blogs, s, and tweets. This is just as important as, and actually much higher in volume than, structured data. Wo text mining is an increasingly important part of data mining.

26 These techniques are derived from computer science, artificial intelligence, and statistics.The choice of an appropriate technique depends on the nature of the data to be analyzed, as well as the size of the data set.

27 Different data mining techniques are appropriate for different applications. For example, applications dealing with classification problems are typically done using regression or decision trees. This is useful for target marketing and population profiling. Product affinity and upselling is done via the affinity technique (also called association). The textbook describes a product called KNIME that used k-means, a clustering algorithm, for analyzing churn.

28 Online Analytical Processing (OLAP) ToolsExercise : use MS Query to connect DBMS, and analyze with MS Excel (Power) Pivot connect via ODBC, and query by SQL or Wizard (to choose tables and fields)

29 connect via ODBC, and query by SQL or Wizard (to join tables)

30 import the data into a sheet in MS Excel

31 insert a pivot table, choose fields and measures.to filter and drill down the data multi-dimensionally