1 Kamanja in Action: Driving Value through Continuous DecisioningCustomer Advisory Council April 2016
2 Agenda Developments in Big Data | Open source in the bigger pictureLambda Architecture & Kamanja | Why it is important & how it applies to you Enabling Faster, Better Analytics | Modelling and Kamanja Continuous Decisioning in Action | Live Demo of Kamanja Future of Continuous Decisioning | Kamanja Architecture & Technology Roadmap Working Together | Feedback & Innovation for Kamanja Use Cases
3 BIG DATA LANDSCAPE (2016) a Open source technologies are central in areas of greatest change in the big data ecosystem
4 The explosion of complexity creates pressure to evolve and innovateRacing to capitalize on three major advancements in the data space: Big Data (i.e. massive, inexpensive storage and distributed computing) Real-time processing Data Science With the explosion of tools within each area, the effective integration of these technologies to drive business value remains a significant challenge Open source technologies play a critical role in the evolving data ecosystem across each of these three areas
5 Why Open Source Open Source Benefits Cost Quality Security FreedomLeverage robust community Higher quality of code No vendor lock-in Control over data and code Cost: It’s free, and it runs on commodity hardware Quality: Linus’ law: Given enough eyeballs, all bugs are shallow Security: The more people looking for vulnerabilities, the more likely they will be fixed. Opposite of “security through obscurity argument Freedom: No vendor lock in, no upgrade treadmill.
6 Introducing Lambda Architecture and Kamanja
7 Old ways of thinking are entrenched in the traditional decisioning architectureTwo distinct, unlinked data processing channels exist in traditional decisioning Processing of events through a real-time decision engine, potentially with access to an offline data store An asynchronous offline process where decision models are constructed and optimized
8 The core framework of Lambda Architecture is powerful but has fundamental limitationsEnables advanced, real-time analytics through batch- and real-time processing of big data in parallel Fundamental approach to provide views of the data that optimally combine the best aspects of batch-processing and real-time Limitations in input/output and model implementation inhibit direct extension to many classes of applications including continuous decisioning
9 The Extended Lambda Architecture is critical to enable continuous decisioning functionality1 2 3 6 5 4 1 Decisioning is applied to all data immediately upon availability 3 Enhancements to the decisioning process are enabled through continuous feedback of data and model updates 5 Standard case management reports and workflow are augmented with advance data visualization, drill-through capabilities, and search 2 Decisioning leverages all available data, including data stored in other layers 4 Actions may include triggering the start of other processes or sending alerts to a case management system 6 Models may be built and tested using all available data and a variety of tools, then quickly and easily deployed into production
10 Lambda Architecture with KamanjaLeveraging Open Source Big Data Technologies
11 Telephony Interception Customer churn/ retentionWhy is continuous decisioning important? Continuous decisioning is critical when… A decision must be made in real time Decisions should be based upon incoming event data and multiple sources of stored data Changes to stored data should immediately impact decision-making Model creation is complicated and requires access to many data points Models should adaptively evolve to optimize a decision’s performance USE CASES Fraud Risk Analysis Customer Contact Cyber Crime Telephony Interception Customer churn/ retention Security & Compliance Audit & Governance Marketing Real-Time Offer
12 LigaData launched and champions Kamanja – an open source continuous decisioning platform, hardened for enterprise reliability requirements, scalable to IoT level data volumes, enabling low latency use cases. QUICK STATS Building a Best in Class Decisioning Engine More than 40,000 man hours invested to date 116,000 lines of code already written 18 releases COMPLEMENTARY TECHNOLOGIES
13 Modelling and Kamanja
14 Enabling Faster, Better Analytics Modelling Approach on KamanjaData Mining Modellers use many tools Languages with rich, powerful libraries: R, Python Data Mining Software Packages: SAS Enterprise Miner, Salford Systems, Rapid Miner, KNIME, SPSS (18 produce PMML) Default to “combination of many models” (wisdom of jury) Kamanja provides one process for production-izing Can go into production in hours vs. weeks (focus on training) Easier for team to switch between software and algorithms Easier to hire (not limited to being an “X shop”) Deal with the expected shortage of ~1mm Data Scientists
15 Kamanja Supports a Diverse Modelling Toolset Matrix of Vendors and Algorithms for Continuous Decisioning
16 Modelling Approach on Kamanja: Model ManagementRequirement: need to manage 10’s to 10k’s of models Per data segment (customer, network section, product area) Per business strategy within segment (cross sell, attrition, fraud, ….) Per deployment within segment (risk mitigation type, media channel) Models have a NORMAL LIFECYCLE The training data captures one snapshot of the universe Broad behavior normally drifts over time (i.e. bull vs. bear market) Update by refreshing model training with more current data Use A / B testing to transition from old to refreshed model
17 Kamanja Demo
18 Continuous Decisioning in Action: Live Kamanja DemoTo address these challenges, LigaData is creating a user interface for Kamanja that allows Easy deployment of new models The ability to monitor throughput and performance intuitively and flexibly. Easy filtering and drill-downs CLICK HERE FOR DEMO
19 Kamanja Architecture & Roadmap
20 Kamanja Architecture
21 Kamanja | Current Feature SetEnterprise Readiness Basic statistics Metadata change audits Dual role security (admin vs. non admin) Performance & Scalability Leverage Big Data stack Parallel processing of models & messages Compiling to JVM DAG Models Languages - Java, Scala, PMML, JSON Data mining tools – 2 PMML Producers validated – R, KNIME Integrations & Interoperability Databases Logs Flat files Social Media data such as Twitter Streaming data sources - Kafka, MQ NoSQL DBs Applications Reporting tools HDFS Ease of Use Simplified installation process Auto migration from older versions Support for evolving Hadoop stack Developer utilities (PMML test, clean utility, JSON validation) UI Prototype
22 The Future of Continuous Decisioning | Near Term Priorities for KamanjaEnterprise Readiness | Help clients meet their security, audit (compliance), and resource efficiency needs Increased Performance & Scalability | Enable SLA compliance for applications Extending Model support | Support models from popular data mining tools as well as models developed in Python language Expanded Integrations and Interoperability | Support standard transports such as Flume, UDP and HTTP as well as standard data formats such as Avro Increased ease of use | Enable more efficient development & testing of models & develop intuitive web UI to support efficient model management/development and administration
23 The Future of Continuous Decisioning | Focus of Development for Enterprise ReadinessExisting Features Basic statistics Metadata change audits Dual role security (admin vs. non admin) Planned Features Multi-tenancy Encryption & tokenization Security Auditing/Data Lineage Integration with popular monitoring tools Integrate with Resource Managers (ex. YARN) Enable resource sharing across Kamanja installations as well as other Big Data installations Meet enterprise security, audit and monitoring requirements Meet uptime requirements of Kamanja based mission critical applications
24 Focus of development to enhance performance and scalabilityKey Developments Optimized DAG Distributed and hieratical cache utilization Support for logical partitions SLA aware Rationale Enable dynamic scaling by decoupling parallel processing from physical partitioning Support SLA critical applications by providing priority based executions Dynamically adjust execution pipelines to optimize performance Optimize performance of Kamanja on Hadoop storage
25 Focus of development to expand the range of integration and interoperabilityKey Developments Inputs - HTTP/UDP end points, Flume, AVRO Format Outputs - AVRO Format, Graph DB, Elastic Search Rationale Support wider range of data sources and transports used in the enterprise Consume and produce standard formats to enable better interoperability across systems Provide ability to integrate with advanced analytical tools to populate data/decisions in real time
26 Focus of development development to improve enterprise readinessKey Developments Multi-tenancy Encryption & tokenization Security Auditing/Data Lineage Integration with popular monitoring tools Integrate with Resource Managers (ex. YARN) Rationale Enable resource sharing across Kamanja installations as well as other Big Data installations Meet enterprise security, audit and monitoring requirements Meet uptime requirements of Kamanja based mission critical applications
27 Customer Feedback
28
29 Supplemental Materials
30 Kamanja Roadmap – Current State Currently available featuresPerformance/Scalability Models Enterprise Readiness Leverage Big Data stack Parallel processing of models & messages Compiling to JVM DAG Languages - Java, Scala, PMML, Json, Data mining tools – 2 PMML Producers validated – R, KNIME Basic statistics Metadata change audits Dual role security (admin vs. non admin) Integrations & Interoperability Databases Logs Flat files Social Media data such as Twitter Streaming data sources - Kafka, MQ NoSQL DBs Applications Reporting tools HDFS Ease of Use Simplified installation process Auto migration from older versions Support for changing Hadoop stack Developer utilities (PMML test, clean utility, Json validation) UI Prototype
31 Kamanja Roadmap – Future State Features targeted over the next 6 monthsPerformance/Scalability Models Enterprise Readiness Leverage Big Data stack Parallel processing of models & messages Compiling to JVM DAG Optimized DAG Distributed and hieratical cache utilization Support for logical partitions SLA aware Languages - Java, Scala, PMML, Json, Python Data mining tools – 9 PMML Producers validated – R, KNIME, Rapid Miner, SAS Enterprise Miner, Spark MLlib, IBM SPSS, Salford Systems, Tibco, Angoos Basic statistics Metadata change audits Dual role security (admin vs. non admin) Multi-tenancy Encryption & tokenization Security Auditing/Data Lineage Monitoring Integration with popular monitoring tools Integrate with Resource Managers (ex. YARN) No shutdown upgrades Support multiple storages (ex. Cassandra & Hbase; Hbase & Oracle) Integrations & Interoperability Databases Logs Flat files Social Media data such as Twitter Streaming data sources - Kafka, MQ NoSQL DBs Applications Reporting tools HDFS AVRO format HTTP/UDP end points Flume Graph DB Elastic Search Ease of Use Simplified installation process Auto migration from older versions Support for changing Hadoop stack Developer utilities (PMML test, clean utility, Json validation) UI Prototype complete Model Management Administration/Monitoring Rule/Model Development IDE Integration Model testing & validation
32 The Future of Continuous Decisioning | Focus of Development for ModelsExisting Features Languages - Java, Scala, PMML, JSON Data mining tools – 2 PMML Producers validated – R, KNIME Planned Features Additional Languages - Python Additional Data mining tools – 9 PMML Producers validated – Rapid Miner, SAS Enterprise Miner, Spark MLlib, IBM SPSS, Salford Systems, Tibco, Angoos Support commonly used language for developing custom data mining models Consume models produced by widely used data mining tools to reduce adoption barriers
33 The Future of Continuous Decisioning | Focus of Development for Ease of UseExisting Features Simplified installation process Auto migration from older versions Support for changing Hadoop stack Developer utilities (PMML test, clean utility, Json validation) UI Prototype complete Planned Features IDE Integration Model testing & validation Model Management Administration/Monitoring Rule/Model Development Simplify model change management process for system admins and power users Enable easier management and monitoring of production systems Reduce model development complexity by allow developers to utilize standard tools and reduce time to market
34 The Future of Continuous Decisioning | Focus of Development for Integrations & InteroperabilityExisting Features Databases Logs Data Warehouses Flat files Social Media data such as Twitter Streaming data sources - Kafka, MQ NoSQL DBs Files Applications Reporting tools HDFS Planned Features AVRO format HTTP/UDP end points Flume Graph DB Elastic Search Support wider range of data sources and transports used in the enterprise Consume and produce standard formats to enable better interoperability across systems Provide ability to integrate with advanced analytical tools to populate data/decisions in real time
35 The Future of Continuous Decisioning | Focus of Development for Performance & ScalabilityExisting Features Leverage Big Data stack Parallel processing of models & messages Compiling to JVM DAG Planned Features Optimized DAG Distributed and hieratical cache utilization Support for logical partitions SLA aware Enable dynamic scaling by decoupling parallel processing from physical partitioning Support SLA critical applications by providing priority based executions Dynamically adjust execution pipelines to optimize performance Optimize performance of Kamanja on Hadoop storage
36 Focus of development to increase ease of useKey Developments Production UI for Model Management UI for Administration/Monitoring Development UI for Rule/Model Development IDE Integration Model testing & validation Rationale Simplify model change management process for system admins and power users Enable easier management and monitoring of production systems Reduce model development complexity by allow developers to utilize standard tools and reduce time to market
37 Focus of development to expand model supportKey Developments Languages - Python Data mining tools – Additional PMML Producers validated – Rapid Miner, SAS Enterprise Miner, Spark MLlib, IBM SPSS, Salford Systems, Tibco, Angoos Rationale Support commonly used language for developing custom data mining models Consume models produced by widely used data mining tools to reduce adoption barriers
38 Kamanja Roadmap | Overview & TimelineQ2’ 16 Q3’ 16 Q4’ 16 Q1’ 17 Ease of Use Administration Model Support Performance/ Security Adapters Model Management Ease of Use – UI Model Adapters – Incorporate new features & validate compatibility – prereqs Validate PMML producers & model types (where possible) Model validation – trial run, test before committing & validating Multi-tenancy Basic – process (input, output, storage)/client isolation – 1.4 Adding new tenants – 1.4 resource isolation – Q2 SLA – Q4 Statistics for accounting purpose – Q3 General Security Role based security for API access Model Security Integrate with Flume – data flow smoothly Resource Manager integration (YARN, etc) – Q3
39 Kamanja Roadmap | Near Term Release Objectives– April Release 1.5 – June Release 1.6 – July Backlog Priority 1 Backlog Priority 2 Backlog Priority 3 Ease of Use No shutdown changes (configuration only) New sample models & tutorials Reduced package size Support metadata backward compatibility Provide tool for Configuration JSON file validation Upgrade cluster without shut down Validate compatibility with new versions of Zookeeper, Kafka (0.9 API including kerberos authentication), Scala, etc. and incorporate new features Validate compatibility with new versions of Zookeeper, Kafka, Scala, etc. and incorporate new features Enhanced error and exception handling of adapters IDE integration for manual model development Support data mining tools that don’t produce PMML such as R scripts Integrate with Flume Administration Basic Multi-tenancy – process & client isolation Message Level Tracking Failure notification as message Additional APIs to expose monitoring data Tool to purge container data Multi-tenancy – Add new tenant & Resource isolation UI for Administration Internal memory usage statistics Enable dynamic resource management Resource Manager Integration (YARN, etc.) Multi tenant – accounting support Model Support New message structure and unification Transformations Data formats (JSON, CSV) Execution DAG Identify incompatibilities with various PMML producers (ex. SAS) Concepts Data formats (AVRO) Python Support Identify incompatibilities with various PMML producers Custom models developed in DSL (develop DSL support) Performance/Security Basic Cache coherency Advanced Cache coherency Hierarchical cache Logical Partitions Investigate authentication and authorization support for Kamanja Role based security for API access Model Security Implement WAL Field level encryption and tokenization Cost aware model execution in cloud environments SLA aware execution Statistics based dynamic sizing Native code generation to leapfrog performance Adapters Message Bindings Smart File Adapter HDFS Adapter JDBC Storage Adapter Support more than one storage adapter Elastic Connector Model Management Associate models to events/messages UI for Model Management Reporting & Audit Support of Data/Decisions Develop new Models based on Templates & global parameters Rule/Model Builder UI Model Validation A/B Testing
40 The Journey to Continuous DecisioningBig Data for Detection and Investigation Taking action at the soonest possible moment, based on all incoming events and historical data, leveraging the most sophisticated predictive models Continuous Decisioning Near Time Alerting Real time alerts Fully integrated w/workflow Ability to learn, iterate models easily Data Analysis Streamlined data pipeline Custom alerts Data Management Retrospective analysis Canned reports Retain the logs Meet compliance requirements
41 Continuous Decisioning: KamanjaCorrelate Historical Reference Data Lake, Scoring Models Real-time Ingestion (Structured / Unstructured) Notifcation (Visualization, Alert, Case Mgmt and Action) Event Decisioning Business Rules, Pattern Analysis TECH
42 LigaData transforms how enterprises leverage their data using open source Big Data technologies.WHO WE ARE Founded by former Yahoo Executives Led data technology innovation at Yahoo Grew a $3 billion business by detecting signals in Web data Over 40+ patents in data technologies WHAT WE DO Take our deep data experience and focus on the challenges of the financial services industry Implement continuous decisioning for threat detection and compliance on a robust open source technology stack
43 Need input into positioning of tech in the areas belowOpen source technologies are causing seismic shifts in the big data ecosystem XX Need input into positioning of tech in the areas below Big Data Real Time Processing Data Science R Python TensorFlow
44 Kamanja | Current Feature SetOPTION 2 Kamanja | Current Feature Set Ease of Use Integrations & Interoperability Models Enterprise Readiness Performance & Scalability Simplified installation process Auto migration from older versions Support for changing Hadoop stack Developer utilities (PMML test, clean utility, Json validation) UI Prototype Databases Logs Flat files Social Media data such as Twitter Streaming data sources - Kafka, MQ NoSQL DBs Applications Reporting tools HDFS Languages - Java, Scala, PMML, Json, Data mining tools – 2 PMML Producers validated – R, KNIME Basic statistics Metadata change audits Dual role security (admin vs. non admin) Leverage Big Data stack Parallel processing of models & messages Compiling to JVM DAG