A framework for easy development of Big Data applications

1 A framework for easy development of Big Data applicatio...
Author: Mary Williamson
0 downloads 1 Views

1 A framework for easy development of Big Data applicationsRubén Casado @ruben_casado

2 Agenda Big Data processing Lambdoop framework Lambdoop ecosystemCase studies Conclusions

3 About me :-)

4 PhD in Software Engineering MSc in Computer Science BSc in Computer Science Academics Work Experience

5 About Treelogic

6 Treelogic is an R&D intensive company with the mission of creating, boosting, developing and adapting scientific and technological knowledge to improve quality standards in our daily life

7 TREELOGIC – Distributor and Sales

8 International ProjectsRegional Projects R&D Manag. System Internal Projects Research Lines Computer Vision Big Data Teraherzt technology Data science Social Media Analysis Semantics Security & Safety Justice Health Transport Financial services ICT tailored solutions Solutions R&D

9 7 ongoing FP7 projects ICT, SEC, OCEAN Coordinating 5 of them 3 ongoing Eurostars projects Coordinating all of them

10 INNOVATION Research & More than 300 partners in last 3 years7 years’ experience in R&D projects More than 40 projects with budget over 120 MEUR Overall participation in 11 European projects Project coordinator in 7 European projects Research & INNOVATION

11

12 Agenda Big Data processing Lambdoop framework Lambdoop ecosystemCase studies Conclusions

13 What is Big Data? A massive volume of both structured and unstructured data that is so large to process with traditional database and software techniques

14 How is Big Data? Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization - Gartner IT Glossary -

15 3 problems Volume Variety Velocity

16 3 solutions Batch processing Real-time processing NoSQL

17 3 solutions Batch processing Real-time processing NoSQL

18 Batch processing Scalable Large amount of static data DistributedParallel Fault tolerant High latency Volume

19 Real-time processing Low latency Continuous unbounded streams of dataDistributed Parallel Fault-tolerant Velocity

20 Hybrid computation modelLow latency Massive data + Streaming data Scalable Combine batch and real-time results Volume Velocity

21 Hybrid computation modelAll data New data Batch processing Real-time processing Batch results Stream Combination Final results

22 Processing Paradigms Batch processing Real-time processing2003 Inception Batch processing Large amount of statics data Scalable solution Volume Real-time processing Computing streaming data Low latency Velocity Hybrid computation Lambda Architecture Volume + Velocity 2006 1ª Generation 2010 2ª Generation 2014 3ª Generation

23 Processing Pipeline DATA ACQUISITION DATA STORAGE DATA ANALYSISRESULTS

24 Agenda Lambdoop framework Big Data processing Lambdoop ecosystemCase studies Conclusions

25 What is Lambdoop? Open source frameworkSoftware abstraction layer over Open Source technologies Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident, Avro, Redis Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process Same single API for the three processing paradigms Batch processing similar to Pig / Cascading Real time processing using built-in functions easier than Trident Hybrid computation model transparent for the developer

26 Why Lambdoop? Building a batch processing application requiresMapReduce developing Use other Hadoop related tools (Sqoop, Zookeper, HCatalog …) Storage systems (Hbase, MongoDB, HDFS, Cassandra…) Real-time processing requires Streaming computing (S4, Storm, Samza) Unboundend input (Flume, Scribe) Temporal data stores (In-memory, Kafka, Kestrel)

27 Why Lambdoop? Building a hybrid computation system (Lambda Architecture) requires Application logic has to be defined in two different systems using different frameworks Data must be serialized consistently and kept in sync between each system Developer is responsible for reading, writing and managing two data storage systems, performing a final combination and serving the final updated results

28

29

30 Why Lambdoop? “One of the most interesting areas of future work is high level abstractions that map to a batch processing component and a real-time processing component. There's no reason why you shouldn't have the conciseness of a declarative language with the robustness of the batch/real-time architecture”. Nathan Marz “Lambda Architecture is a implementation challenge. In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop (…) are there but there is a shortage of people with the expertise to leverage them. Rajat Jain

31 Lambdoop Streaming data Workflow Data Operation Data Static data

32 Lambdoop Batch Real-Time Hybrid

33 Data Input Information represented as Data objects Types: StaticDataStreamingData Every Data object has a Schema to describe the Data fields (types, nulleables, keys…) A Data object is composed by Datasets. 33

34 Data Input Dataset A Data object is formed by one or more Datasets.All Datasets of a Data object share the same Schema Datasets are formed by Register objects, A Register is composed by RegisterFields. 34

35 Data Input Schema Very similar to Avro definition schemas.Allow to define input data’s structure, fields, types, nulleables… Json format Station Title Lat. Lon. Date SO2 NO CO PM10 O3 dd vv TMP HR PRB 23 street 43.529 5.673 7 8 0.35 13 67 158 3.87 18.8 34 982 32 road 44.5 5.72 8.6 0.4 12 68 19 33 975 { "type": "csv", "name": "AirQuality records", "fieldSeparator": ";", "PK": "", "header": "true", "fields": [ {"name": "Station","type": "string","index": 0}, {"name": "Tittle","type": "string","index": 1,"nullable": "true"}, {"name": "Lat.","type": "double","index": 2,"nullable": "true"}, {"name": "Long.","type": "double","index": 3,"nullable": "true"}, {"name": "PRB","type": "double","index": 20,"nullable": "true"} ] } 35

36 Data Input Importing data into LambdoopLoaders: Import information from multiple sources and store it into the HDFS as Data objects Producers: Get streaming data and represent it as Data objects Heterogeneous sources. Serialize information into Avro format 36

37 Data Input Static Data example: Importing a Air Quality dataset from local logs to HDFS Loader Schema’s path is files/csv/Air_quality_schema //Read schema from a file String schema = readSchemaFile(schema_file); Loader loader = new CSVLoader("AQ.avro", uri, schema) Data input = new StaticData(loader); 37

38 Data Input Streaming Data example: Reading streaming sensor data from TCP port Producer Weather stations emit messages to port 8080 Schema’s path is files/csv/Air_quality_schema int port = 8080; //Read schema String schema = readSchemaFile (schema_file); Producer producer = new TCPProducer ("AirQualityListener", refresh, port, schema); // Create Data object Data data = new StreamingData(producer) 38

39 Data Input ExtensibilityUsers can implement their own data loaders/producers Extend Loader/Producer interface Read data from original source Get and serialize information (Avro format) considering Schemas 39

40 Operations Unitary actions to process dataAn Operation takes Data as input, processes the Data and produces another Data as output Types of operations: Aggregation: Produces a single value per DataSet Filter: Output data has the same schema as input data Group: Produces several DataSet, grouping registers together Projection: Changes the Data schema, but preserves the records and their values Join: Combines different Data objects

41 Operations Operations Aggregation(1) Count Average Sum MinValueMaxValue Mode Aggregation(2) Skewness Z-Test Stderror Variance Covariance Filter Limit TopN BottomN Max Min Group RollUp Cube N-Til Projection Select Frecuency Variation Join Inner Join Left Join Right Join Outer Join

42 Operations Extensibility (User Defined Operations): New operations can be defined implementing a set of interfaces: OperationFactory: Factory used by the framework in order to get batch, streaming and hybrid operation implementations when needed BatchOperation: Provides MapReduce logic to process the input Data StreamingOperation: Provides Storm/Trident based functions to process streaming registers HybridOperation: Provides merging logic between streaming and batch results

43 Operations User Defined Operation interfaces

44 Workflows Sequence of connected Operations. Manages tasks and resources (check-points) in order to produce an output using input data and a set of Operations BatchWorkflow: Runs a set of operations on StaticData input and produces a new StaticData as output StreamingWorkflow: Operates on a StreamingData to produce another StreamingData HybridWorkflow: Combines Static and Streaming data to produce completed and updated results (StreamingData) Workflow connections Data Workflow Data Workflow Data Workflow Data Workflow Workflow Data Workflow Data

45 Workflows // Batch processing exampleString schema = readSchemaFile(schema_file); Loader loader = new CSVLoader("AQ.avro",uri, schema) Data input = new StaticData(loader); Workflow wf = new BatchWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue(«street 45")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Run the workflow wf.run(); //Get the results Data output = wf.getResults();

46 Workflows //Real-time processing exampleProducer producer = new TCPPortProducer("QAtest", schema, config); Data input = new StreamingData(producer); Workflow wf = new StreamingWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("Estación Av. Castilla")); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addOperation(filter); wf.addOperation(avg); //Runs the workflow wf.run(); //Gets the results While (!stop){ Data output = wf.getResults(); … }

47 Workflows // Hybrid computation exampleProducer producer = new PortProducer("catest", schema1, config); StreamingData streamInput = new StreamingData(producer); Loader loader = new CSVLoader("AQ.avro",uri, schema2) StaticData batchInput = new StaticData(loader); Data input = new HybridData(streamInput, batchInput); Workflow wf = new HybridWorkflow(input); //Add a filter operation Filter filter = new Filter(new RegisterField("Title"), ConditionType.EQUAL, new StaticValue("street 34")); wf.addOperation(filter); //Calculate SO2 average on filtered input data Avg avg = new Avg(new RegisterField("SO2")); wf.addOperation(avg); //Run the workflow wf.run(); //Get the results While (!stop) { Data output = wf.getResults();}

48 Results exploitation Data StdError Select Cube Variance join … FilterRollUp StdError Avg Select Cube Variance join VISUALIZATION Data EXPORT CSV, JSON, … ALARM SYSTEM

49 Results exploitation Visualization /* Produce from Twitter */TwitterProducer producer = new TwitterProducer(…); Data data = new StreamingData(producer); StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); /* Get results from workflow*/ Data results = wf.getResults(); /* Show results. Set dashboard refresh*/ Dashboard d = new Dashboard(config); d.addChart(LambdoopChart.createBarChart(results, new RegisterField("count"), “Tweetscount");

50 Results exploitation Visualization

51 Results exploitation Visualization

52 Results exploitation Export Data data = new StaticData(loader);Workflow wf = new BatchWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); /* Get results from workflow*/ Data results = wf.getResults(); /* Export results */ Exporter.asCSV(results, File); MongoExport(results, Map conf); PostgresExport(results, Map conf); CSV, JSON, …

53 Results exploitation Alarms Data data = new StreamingData(producer);StreamingWorkflow wf = new StreamingWorkflow(data); /* Add operations to workflow*/ wf.addOperation(new Count()); /* Get results from workflow*/ Data results = wf.getResults(); /* Set alarm condition: T/F (e.g time or certain value) action: execution (e.g. show results, send an )*/ AlarmFactory.setAlert(results, condition, action);

54 Agenda Lambdoop ecosystem Big Data processing Lambdoop frameworkCase studies Conclusions

55 Change configurations and easily manage the cluster Friendly tools for monitoring the health of the cluster Wizard-driven Lambdoop installation of new nodes 55

56 Visual editor for defining workflows and scheduling tasksPlugin for Eclipse Visual elements for: Input Sources Loader Operations Operation parameters RegisterFields Static values Visualization elements Generates workflow code XML Import/Export Scheduling of workflows 56

57 Tool for working with messy big data, cleaning it and transforming it.Import data in different formats Explore datasets Apply advanced cell transformations Refine inconsistencies Filter and partition your big data

58 Agenda Case studies Big Data processing Lambdoop frameworkLambdoop ecosystem Case studies Conclusions

59 Social Awareness Based Emergency Situation SolverObjective: To create event assessment and decision-making supporting tools which improve quickness and efficiency when facing emergency situations making. Exploit the information available in Social Networks to complement data about emergency situations Real-time processing

60 “Attached” resources (photo, video, links,…)Alert detection Locations Information “Attached” resources (photo, video, links,…)

61 Static stations and mobile sensors in Asturias sending streaming dataHistorical data of > 10 years Monitoring, trends identification, predictions Batch processing + Real processing+ Hybrid computation

62 Quantum Mechanics Molecular DynamicsComputer simulation of physical movements of microscopic elements Large amount of data as streaming in each time-step Real-time interaction (query, visual exploration) during the simulation Data analytics on the whole dataset Real time processing + Batch processing + Hybrid computation

63 Agenda Conclusions Big Data processing Lambdoop frameworkLambdoop ecosystem Case studies Conclusions

64 Conclusions Big Data is not only batch processingTo implement a Lambda Architecture is not trivial Lambdoop: Big Data made easy High abstraction layer for all processing model All steps in the data processing pipeline Same Java API for all programing paradigms Extensible

65 Conclusions Roadmap Now Next Models (Mahout, Jubatus, Samoa, R) BeyondRelease a early version of Lambdoop Framework as Open Source Get feedback from the community Increase the set of built-in functions Next Move all components to YARN Stable versions of Lambdoop ecosystem Models (Mahout, Jubatus, Samoa, R) Beyond Configurable processing engines (Spark, S4, Samza …) Configurable data stores (Cassandra, MongoDB, ElephantDB, VoltDB …)

66 If you want stay tuned about Lambdoop register in @ruben_casado @datadopter @treelogic