1 Use Cases for Governing Hadoop
2 About BIAS CorporationWho We Are… Founded in 2000 Distinguished Oracle Leader Technology Momentum Award Portal Blazer Award Titan Awards Excellence in Innovation Award Management Team is Ex-Oracle 250 U.S. employees & contractors, 100 India employees, average with 10+ years of Oracle experience Inc.500|5000 Fastest Growing Private Company in the U.S. for the 7th Time Voted Best Place to work in Atlanta for 3rd year Top 10 Healthiest Workplace in Atlanta Business Chronicle 33 Oracle Specializations spanning the entire stack
3 About the Speaker Kenton Troy DavisSenior Director & Enterprise Architect, BIAS Corporation Patented work in database security and IoT Oracle Alumnus Statistician before being labeled a Data Scientist
4 Big Data Growth 2016 2017 - 2014 2015 18 Node POC 66 Nodes 192 NodesProjected to be Largest Oracle Big Data Appliance Implementation at a Bank 2016 2017 - 2014 2015 18 Node POC 66 Nodes 192 Nodes 2-3x growth currently projected
5 Some real-world use cases at the BankIncrease the data points used to profile a customer (WCV 360) Use analytics to derive real time offers (RTO) for customers banking at branches Minimize the data sprawl and establish a single source of truth (SSOT) Replace legacy data warehouses used for transactional inputs Establish better management around how data is being consumed and by whom Achieve all of the above with a scalable, lower-cost platform that aggregates storage
6 Storage Projections 2015 2016 2017 2018+ Social Media 23.002015 2016 2017 2018+ Social Media 23.00 IT Operational Data 11.50 Documentation, Images, Cheques Images (ECM) 57.50 Third Party Data Sources (700 Sources); Reference/ Bureau Quarterly 50.60 Bureau 8.05 Total Volume (TB) 142.57 323.94 505.31 695.75 Assumed consistent growth, Uncompressed estimates, Not including HDFS replication
7 Data Lake Architecture at the Bank
8 Data Wrangling ChallengeWhat happens after the POCs actually work? What happens when internal adoption of Hadoop occurs faster than anticipated? Prevent the Data Lake from becoming a Data Swamp Encourage consumers to collaborate via a shared data catalog Focus even more on data cleansing and preparation - HDFS schema-on-read encourages naïve ingestion Glue the Apache ecosystem and vendor tools together by linking governance to enterprise security ‘Operationalize’ all of the above
9 Data Governance WhiteboardIngestion Consumption Zone 1 Introspection Discovery Zone 3 Zone 2 Business Catalog Metadata Data Lifecycle Management Compliance Authentication, Authorization, Auditing Encryption / Masking
10 Data Governance – Introspection and DiscoveryIngestion Consumption Introspection Discovery Automate the parsing of unstructured and semi-structured feeds Infer data types Detect sensitive data (e.g. PII, PCI, HIPAA) Enable data stewards to interact with and adjust the process Search with faceted navigation Enable sandbox, ad-hoc queries Customize dashboards and reports
11 Data Governance – Smart Data CatalogingTrack lineage Audit and report Maintain chain of custody Tag PII and PAN data Base tags on resource, location, or time Business Catalog Metadata Data Lifecycle Management Better to use graph databases here Aggregated search is crucial Tags should trigger encryption, masking, and access rules Time-based usage tracking important for subpoenas and SOX Enforcement
12 Lineage Tracking Example #1
13 Lineage Tracking Example #2
14 Data Governance – ComplianceAuthentication, Authorization, Auditing Encryption / Masking Develop AAA policies using both tags and resources Integrate with Enterprise security 1. Active Directory and LDAP 2. Key Management Service (KMS) Enforce separation of duties
15 Data Governance – ChallengesHolistic solutions are still evolving and require plugins to various Hadoop features (e.g. HDFS abstraction is rapidly maturing beyond Hive). Hortonworks example: Apache Atlas Apache Falcon Apache Ranger Apache Hive
16 Data Governance – ChallengesData Lifecycle Management components become key to taking Hadoop into Production: Policies that are responsive to late data handling – tag mutation, rules customization Support for rolling upgrades and cleanup H/A support via replication Lineage tracking that is easily visible to auditors with drilldown and collapse Metadata creation and ease of use are still evolving: Exchange of metadata in many cases requiring custom coding (e.g. REST/JSON) Tags against a parent object not following derived objects Need to still maintain a business taxonomy
17 (Appendix) Credit Card Transaction PartiesMasterCard Network Merchant Acquirer Card Issuer Issuing Bank Acquiring Bank Merchant Store Service Provider Cardholder https://www.suntrust.com/personal-banking/credit-cards
18 (Appendix) PCI-DSS Data Security Standard V3.2Mask the Primary Account Number (PAN) such that at most only the first six digits and the last four digits are displayed. If a full unmasked PAN needs to be persisted, then it must be saved in encrypted form at rest. Documented procedures must exist for key management processes used for strong cryptography – e.g. for backup, key storage, key rotation (section 3.6 sub controls), key access, etc. Principle of least privilege (section 7) applies by limiting data access according to which business groups ‘need to know’.
19 (Appendix) Column MaskingAssign Sentry privileges to view Java User-Defined Function (UDF) USE ETL_STAGE_{source_hive_database} CREATE VIEW PII_MASKED_EXAMPLE as SELECT mask_ccn_udf(credit_card_number) as ccn, name, balance, region FROM ETL_STAGE_VIEW_{source_hive_database}.{Table} WHERE state = “VA”
20 (Appendix) HiveServer2 HookMap Reduce
21 (Appendix) Apache SentryActive Directory Users Groups Roles Privileges Actions controlled for: server, database, table, view, column, and metadata tag Metadata Hub Sentry Policy Store Hive Warehouse
22 (Appendix) Oracle Big Data SQLQuery Franchising CREATE TABLE … ORGANIZATION EXTERNAL (TYPE oracle_hive); Exadata External Table Schema-on-read Data Redaction to transform data on-the-fly (e.g. for credit-card masking) Infiniband BDA DBMS_REDACT. ADD_POLICY Smart Scan Hive Warehouse Virtual Private Database context predicates for row-level security DBMS_RLS. ADD_POLICY
23
24 --Talk about how there are pockets of resistance to change in legacy architectures & legacy mindsets, but there are also groups (like Corp Marketing & Fraud) that are primed to be early adopters of Big Data, and how we have tried to empower those users early in our adoption
25 Contact Us Kenton Davis On LinkedIn