1 MCDB: A Monte Carlo Approach to Managing Uncertain DataAuthors : Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher M. Jermaine, Peter J. Haas By : Alaa Khier 4/12/2016
2 What are we going to talk about today :A brief reminder what is DATABASE SYSTEM . SQL . The MCDB : monte Carlo database system . An example. EXPERIMENTS
3 what is DATABASE SYSTEM ? DefinitionA database Is An Organized Collection Of Data. It is the collection of schemas, tables, queries, reports , views, and other objects. A database management system (DBMS) is a computer software application that interacts with the user,
4 SQL stands for Structured Query Language.lets you access and manipulate databases. What Can SQL do? SQL can execute queries against a database SQL can retrieve data from a database SQL can insert records in a database SQL can update records in a database SQL can delete records from a database SQL can create new databases SQL can create new tables in a database SQL can create views in a database SQL can set permissions on tables, procedures, and views …
5 Condition(B1,...,Bm) == Condition over attributesSELECT A1,...,Ak FROM R1,...,Rn WHERE Condition(B1,...,Bm) Attributes List of relation names Condition(B1,...,Bm) == Condition over attributes The result of an SQL query is a list of tuples .
6 Example SELECT DISTINCT name FROM Student, coursessid course 111 DB PL 222 Student sid name year 111 Alan 2 222 Zohar 1 333 Rami The query : SELECT DISTINCT name FROM Student, courses WHERE Student.sid = Courses.sid The query result name Alan Zohar
7 CREATE DATABASE dbname; CREATE TABLE table_name ( column_name1 data_type(size), column_name2 data_type(size), column_name3 data_type(size), .... );
8 SELECT AVG(column_name) FROM table_nameAggregation : Aggregate Functions SQL aggregate functions return a single value, calculated from values in a column. Useful aggregate functions: AVG() Returns the average value COUNT() - Returns the number of rows FIRST() - Returns the first value LAST() - Returns the last value MAX() - Returns the largest value MIN() Returns the smallest value SUM() - Returns the sum example : SELECT AVG(column_name) FROM table_name
9 A new approach to handling enterprise data uncertaintyMCDB
10 CREATE TABLE SBP _DATA (PID ,GENDER, SBP) AS FOR EACH p in PATIENTS WITH SBP AS Normal ( SELECT s.MEAN, s.STD FROM SPB_PARAM s ) SELECT p.PID, p.GENDER ,b.VALUE FROM SBP b
11 The MCDB Algorithm. VG Functions. NEW OPERATIONS IN MCDB . Monte Carlo Benefits. Monte Carlo Challenges. Schema Syntax Example .
12 MCDB : MCDB is based on possible world semantics. A relation is deterministic if its realization is the same in all possible worlds. otherwise it is random. Along with a set of VG (variable generation) functions for generating relation instances .
13 MCDB : The Monte Carlo Database System (MCDB)MCDB designed for flexible exploration of query result distributions . * under arbitrary SQL queries * A broad range of complex SQL queries.
14 MCDB : user defined stochastic models. All the data parametrization
15 MCDB : MCDB uses “variable generation”(VG) functions to pseudo randomly generate instances of each uncertain data value in a database , yielding a sample from the possible DB distribution.
16 CREATE TABLE SBP _DATA (PID ,GENDER, SBP) AS FOR EACH p in PATIENTS WITH SBP AS Normal ( SELECT s.MEAN , s.STD FROM SPB _PARAM s ) SELECT p.PID, p.GENDER ,b.VALUE FROM SBP b.
17 VG functions definition :A VG function is implemented as a C++ class with four public methods: Initialize(), TakeParams() OutputVals() Finalize().
18 How it work : MCDB calls the Initialize(SEED) method.MCDB executes the queries that specify the input parameter tables to the VG function. The result ,made available to the VG function in the form of a sequence of arrays called parameter vectors. The parameter vectors are fed into the VG function via a sequence of calls to TakeParams()
19 5) MCDB then executes the first Monte Carlo iteration by repeatedly calling OutputVals() to produce the rows of the VG function’s output table, 6) MCDB knows that the last output row has been generated when OutputVals() returns a NULL result . 7) MCDB invokes the VG function’s Finalize() method, which deletes any internal VG-function data structures
20 VG Implementation : Exampleimplement DiscreteChoice function for strings: Accepts x1; x2; : : : ; xn and “weights” w1;w2; : : : ;wn then normalizes the weights into a vector of probabilities P = (p1; p2; : : : ; pn) with pi = wi/ wj , returns a random string X distributed according to P.
21 The TakeParams() function simply adds the incoming parameter vector v to the list L.
22
23 The Finalize() method de-allocates the storage for L and destroys myRandGen.
24 VG functions can be arbitrary, it is very difficult to analytically compute the effect on the query result of the uncertainty that they embody. The solution MCDB avoids this problem by, using the VG functions to generate a large number of independent and identically distributed. realizations of the random database . running the query on each of them. summarizes the effect of the underlying uncertainty in the form of probability distribution over the possible query results .
25 MCDB : Repeating this process multiple times ,generates a set of independent and identically distributed samples from this distribution.
26 MCDB : Tuple bundles All the data parametrization Stochastic modelVG F Tuple bundles n database instances
27 TUPLE BUNDLES IN DETAIL :A tuple bundle T with schema S : Simply an array of N tuples Tuple bundles are manipulated using the new operators and all SQL operators . Realized attribute values for a random table R can be bundled With requirement on set of TUPLE BUNDLES ti for each i the set ri = Ùj tj[i] . MCDB tries to bundle tuples so as to maximize the number of “constant” attributes.
28 MCDB allows the implementer of a VG function to specify attributes as constant as a hint to the system . MCDB creates one tuple bundle for every distinct combination of constant-attribute values encountered. Non-constant attributes values compressed form by Storing only the seed .
29 MCDB : Given an SQL aggregation query of interest,MCDB executes the query on each sampled DB instance Generating samples from the query result distribution.
30 MCDB : All the data parametrization Stochastic model All the data VG Fn query results SQL n database instances
31 MCDB :Naive implementationGenerate an instance of each random relation 2. Once an entire instance of the database has been materialized, compile, optimize, and execute Q in the classical manner. Append every tuple in Q’s answer set with a number identifying the current Monte Carlo iteration. run this three steps N iteration .
32 MCDB :Query Processing in MCDBMCDB runs each query one time. MCDB delays random attribute materialization as long as possible. 3. In MCDB, values for random attributes are reproducible.
33 Query Answer :
34 MCDB : the query resultsMCDB returns its query results as a set of (ti; fi) pairs, ti distinct tuples produced in the course of N Monte Carlo iterations . fi is the fraction of the N possible worlds in which tuple ti appears. All the MCDB query results are returned in a graph .
35 Monte Carlo Benefits. exact computation imposes strong restrictions both on the class of queries that can be handled and on the characteristics of the query answer that can be evaluated . The Monte Carlo approach is that the same general purpose methods apply to any uncertainty model.
36 EXAMPLE CREATE TABLE SBP _DATA (PID ,GENDER, SBP) AS FOR EACH p in PATIENTS WITH SBP AS Normal ( SELECT s.MEAN, s.STD FROM SPB_PARAM s ) SELECT p.PID , p.GENDER , b.VALUE FROM SBP b.
37 EXPERIMENTS : Our experimental study is similarly focused, and has two goals: To demonstrate examples of non trivial and “what-if” analyses that are made possible by MCDB. If this analysis is actually practical in realistic application environment.
38 CREATE VIEW from_japan AS SELECTCREATE VIEW from_japan AS SELECT * FROM nation , supplier , lineitem , partsupp WHERE n_name=’JAPAN’ AND s_suppkey=ps_suppkey AND ps_partkey=l_partkey AND ps_suppkey=l_suppkey AND n_nationkey = s_nationkey CREATE VIEW increase_per_cust AS SELECT o_custkey AS custkey, SUM(yr(o_orderdate) ) /SUM( yr(o_orderdate)) AS incr FROM ORDERS WHERE yr(o_orderdate)=1994 OR yr(_orderdate)=1995 GROUP BY o_custkey CREATE TABLE order_increase AS FOR EACH o in ORDERS WITH temptable AS Poisson ( SELECT incr FROM increase_per_cust WHERE o_custkey=custkey AND yr(o_orderdate)=1995 ) SELECT t.value AS new_cnt, o_orderkey FROM temptable_t SELECT SUM(newRev-oldRev) FROM ( SELECT l extendedprice*(1.0-l_discount)*new_cnt AS newRev, (l_extendedprice*(1.0-l_discount)) AS oldRev FROM increase_per_cust, from_japan WHERE l_orderkey=o_orderkey)
39
40 Q3 : CREATE TABLE prc_hist( ph_month, ph_year, ph_prc, ph_partkey) AS FOR EACH ps in partsupp WITH time series AS Random Walk ( VALUES (ps_supplycost,12,"Dec",1995,(-0.02,0.04)) SELECT month, year, value, ps_partkey FROM time_series_ts SELECT MIN(ph_prc) AS min_prc, ph_month, ph_year, ph_partkey FROM prc_hist GROUP BY ph_month, ph_year, ph_partkey SELECT SUM(min_prc*l_quantity) FROM prc_hist, lineitem, orders WHERE ph_month=month(o_orderdate) AND orderkey=o_orderkey AND yr(o_orderdate)=1995 AND ph_partkey=l_partkey
41 Query Answer :
42 The results obtained by running the four queries
43 The MCDB New Operators
44 The Seed Operator. the Seed operator appends to each tuple created by R’s FOR EACH statement an integer unique to the (tuple, VG function) The Instantiate Operator this operator uses a VG function to generate a set of attribute values corresponding to a Monte Carlo iteration. The Split Operator. takes as input a tuple bundle, with a set of attributes Atts. splits the tuple bundle into multiple tuple bundles, for ech change to constant att . The Inference Operator. The output is distinct tuples and unbundled tuples .
45 Question ??????