Pravega Storage Reimagined for a Streaming World

1 Pravega Storage Reimagined for a Streaming WorldSrikant...
Author: Arleen Todd
0 downloads 0 Views

1 Pravega Storage Reimagined for a Streaming WorldSrikanth Satya & Tom Kaitchuck Dell EMC Unstructured Storage

2 Streaming is DisruptiveHow do you shrink to zero the time it takes to turn massive volumes of data into information and action? Stateful processors born for streaming, like Apache Flink, are disrupting how we think about data computing … We think the world needs a complementary technology … to similarly disrupt storage. No problem! Just process data as it arrives, quickly, and in a continuous and infinite fashion Demands disruptive systems capabilities Ability to treat data as continuous and infinite Ability to dynamically scale ingest, storage, and processing in coordination with data arrival volume Ability to deliver accurate results processing continuously even with late arriving or out of order data

3 Introducing Pravega StreamsA new storage abstraction – a stream – for continuous and infinite data Named, durable, append-only, infinite sequence of bytes With low-latency appends to and reads from the tail of the sequence With high-throughput reads for older portions of the sequence Coordinated scaling of stream storage and stream processing Stream writes partitioned by app key Stream reads independently and automatically partitioned by arrival rate SLO Scaling protocol to allow stream processors to scale in lockstep with storage Enabling system-wide exactly once processing across multiple apps Streams are ordered and strongly consistent Chain independent streaming apps via streams Stream transactions integrate with checkpoint schemes such as the one used in Flink

4 In Place of All This … New Data Analytics Storage Batch ProcessingSlow, Accurate Messaging Storage Stateless Real-Time Processing Fast, Approximate Archive Storage Old Data

5 … Just Do This! Pravega FlinkStreaming Storage Fast, Accurate Stateful Stream Processing New & Old Data Pravega Flink Each component in the combined system – writers, streams, readers, apps – is independently, elastically, and dynamically scalable in coordination with data volume arrival rate over time. Sweet!

6 Pravega Streams + FlinkWriters scale based on app configuration; stream storage elastically and independently scales based on aggregate incoming volume of data Flink utilizes a stream’s transactional writes to extend exactly once processing across multiple, chained apps Raw Stream First Flink App Cooked Stream Next Flink App Segment Worker Segment Worker Social, IoT, … Writers Segment Worker Sink Sink Segment Worker Segment Worker Protocol coordination between streaming storage and stream processor to dynamically scale up and down the number of segments and Flink workers based on load variance over time

7 And It’s Just the Beginning …Enabling a new generation of distributed middleware reimagined as streaming infrastructure Ingest Buffer & Pub/Sub Persistent Data Structures Search Streaming Analytics Pravega Streams Cloud-Scale Storage

8 How Pravega Works Architecture & System Design

9 Pravega Architecture GoalsAll data is durable Data is replicated and persisted to disk before being acknowledged Strict ordering guarantees and exactly once semantics Across both tail and catch-up reads Client tracks read offset, Producers use transactions Lightweight, elastic, infinite, high performance Support tens of millions of streams Low (<10ms) latency writes; throughput bounded by network bandwidth Read pattern (e.g. many catch-up reads) doesn’t affect write performance Dynamic partitioning of streams based on load and throughput SLO Capacity is not bounded by the size of a single node

10 Streaming model Pravega StreamFundamental data structure is an ordered sequence of bytes Think of it as a durable socket or Unix pipe Bytes are not interpreted server side This implicitly guarantees order and non-duplication Higher layers impose further structure, e.g. message boundaries M5 M4 M3 M2 M1 Pravega Stream

11 Cartoon API public interface SegmentWriter {/** Asynchronously and atomically write data */ void write(ByteBuffer data); /** Asynchronously and atomically write the data if it can be written at the provided offset */ void write(ByteBuffer data, long atOffset); /** Asynchronously and atomically write all of the data from the provided input stream */ void write(InputStream in); } public interface SegmentReader { long fetchCurrentLength(); /** Returns the current offset */ long getOffset(); /** Sets the next offset to read from */ void setOffset(long offset); /** Read bytes from the current offset */ ByteBuffer read(int length); }

12 Idempotent Append Pravega Stream WriterAppend { } and Assign appendNumber = 7 Writer

13 Idempotent Append Pravega Stream Writer What is appendNumber? 7

14 Idempotent Append Pravega Stream WriterAppend { } and Assign: appendNumber = 7 If and only if appendNumber < 7 Writer

15 Idempotent output Pravega Stream source sink Sink

16 Idempotent output Pravega Stream source sink Sink

17 Architecture overview - WriteClient write(data1) write(data2) Commodity Server Commodity Server Commodity Server Pravega Pravega Pravega Cache Cache Cache SSD SSD SSD HDFS Bookkeeper Bookkeeper Bookkeeper SSD SSD SSD

18 Architecture overview - ReadClient read() Commodity Server Commodity Server Commodity Server Pravega Pravega Pravega Cache Cache Cache SSD SSD SSD HDFS Bookkeeper Bookkeeper Bookkeeper SSD SSD SSD

19 Architecture overview - EvictFiles in HDFS are organized by Stream Segment Read-ahead cache optimizations are employed Commodity Server Commodity Server Commodity Server Pravega Pravega Pravega HDFS Bookkeeper Bookkeeper Bookkeeper

20 Architecture overview - ReadClient read() Commodity Server Commodity Server Commodity Server Pravega Pravega Pravega Cache Cache Cache SSD SSD SSD HDFS Bookkeeper Bookkeeper Bookkeeper SSD SSD SSD

21 Architecture overview - RecoverData is read from Bookkeeper only in the case of node failure Used to reconstitute the cache on the remaining hosts Commodity Server Commodity Server Commodity Server Pravega Pravega Pravega HDFS Bookkeeper Bookkeeper Bookkeeper

22 Performance CharacteristicsFast appends to Bookkeeper Data is persisted durably to disk 3x replicated consistently <10ms Big block writes to HDFS Data is mostly cold so it can be erasure encoded and stored cheaply If data is read, the job is likely a backfill so we can use a large read-ahead A stream’s capacity is not limited by the capacity of a single machine Throughput shouldn’t be either …

23 Scaling: Segment Splitting & MergingStream S Data Keys Segment 0 Segment 1 Segment 2 Segment 6 Segment 5 3 Segment 7 2 Segment 4 Segment 3 1 t0 t1 t2 t3 t4 Time

24 Scaling: Write ParallelismPravega Producers Writers Writer Configuration ka .. kf  S Stream S Stream Segments ss0 ss1 ssn Readers Number of stream segments dynamically changes based on load and SLO 1 Segments are split and merged dynamically without manual intervention 2 Writer configurations do not change when segments are split or merged 3

25 EventWriter API /** A writer can write events to a stream. */public interface EventStreamWriter { /** Send an event to the stream. Event must appear in the stream exactly once */ AckFuture writeEvent(String routingKey, Type event); /** Start a new transaction on this stream */ Transaction beginTxn(long transactionTimeout); }

26 Scaling: Read ParallelismPravega Producers Writers Writer Configuration ka .. kf  S Stream S Stream Segments ss0 ss1 ssn Readers Readers are notified when segments are split or merged enabling reader parallelism to scale in response to the stream scaling

27 EventReader API public interface EventStreamReader extends AutoCloseable { /** Read the next event from the stream, blocking for up to timeout */ EventRead readNextEvent(long timeout); /** * Close the reader. The segments owned by this reader will automatically be * redistributed to the other readers in the group. */ void close() }

28 Conditional Append Pravega Stream Writer Writer Append {0101011}If and only if it would be appended at offset 123 Writer Writer

29 Synchronizer API /** A means to synchronize state between many processes */ public interface StateSynchronizer { /** Gets the state object currently held in memory */ StateT getState(); /** Fetch and apply all updates to bring the local state object up to date */ void fetchUpdates(); /** Creates a new update for the latest state object and applies it atomically */ void updateState(Function> updateGenerator); }

30 Transactional output Pravega Stream source sink

31 Transactional output Pravega Stream source sink

32 EventWriter and Transaction API/** A writer can write events to a stream. */ public interface EventStreamWriter { /** Send an event to the stream. Event must appear in the stream exactly once */ AckFuture writeEvent(String routingKey, Type event); /** Start a new transaction on this stream */ Transaction beginTxn(long transactionTimeout); } public interface Transaction { void writeEvent(String routingKey, Type event) throws TxnFailedException; void commit() throws TxnFailedException; void abort(); }

33 Transactions … Stream 1, Segment 1 Stream 1, Segment 1, TX-230New Item Stream 1, Segment 1 New Item Stream 1, Segment 1, TX-230 New Item

34 Transactions Append {110…101} to Segment-1-txn-1 Create txn-1Writer Pravega Create txn-1 Append {010…111} to Segment-2-txn-1 Pravega Commit txn-1 Commit txn-1 Controller

35 Transactional output Pravega Stream source sink Pravega Stream sink

36 Transactional output Pravega Stream source sink Pravega Stream sink

37 Transactional output Pravega Stream source sink Pravega Stream sink

38 Transactional output Pravega Stream source sink Pravega Stream sink

39 Pravega: Streaming Storage for AllPravega: an open source project with an open community To be Dell EMC World this May 10th Includes infinite byte stream primitive Plus an Ingest Buffer with Pub/Sub built on top of streams And Flink integration! Visit the Dell EMC booth Flink Forward to learn more Contact us at for even more information!

40 Pravega BB-8 Drawing Stop by the Dell EMC booth and enter to winWinner will be chosen after the closing Keynote Must be present to win for the latest news and information on Pravega!

41

42

43 Why a new storage system?Sink connect Source Sink connect Sink connect Sink

44 Why a new storage system?Connector Real Time Exactly once Durability Storage Capacity Notes HDFS No Yes Years Kafka  Source only Yes* (Flushed but not synced) Days Writes are replicated but may not persisted to durable media. (flush.messages=1 bounds this but is not recommended) RabbitMQ Yes* (slowly) Durability can be added with a performance hit Cassandra  Yes* (If updates are idempotent) App developers need to write custom logic to handle duplicate writes. Sockets None

45 Flink storage needs Flink Implications for storage GuaranteeExactly once Exactly once, consistency Latency Very Low Low latency writes (<10ms) Throughput High High throughput Computation model Streaming Streaming model Overhead of fault tolerance mechanism Low Fast recovery Long retention Flow control Natural Data can backlog Capacity not bounded by single host Separation of application logic from fault tolerance Yes Re-reading data provides consistent results License Apache 2.0 Open Source and linkable

46 Shared config public class SharedConfig { public V getProperty(K key); public V putPropertyIfAbsent(K key, V value); public boolean removeProperty(K key, V oldValue); public boolean replaceProperty(K key, V oldValue, V newValue); } https://github.com/pravega/pravega/blob/87b978e2a75e8b e5021e55b255970f8/clients/streaming/src/test/java/com/emc/pravega/state/examples/MembershipSynchronizer.java

47 Smart Workload DistributionPravega Segment Container Segment Container Segment Container ss0 ss1 ss2 ss3 ss4 The hot segment is automatically “split,” and the “child” segments are re-distributed across the cluster relieving the hot spot while maximizing utilization of the cluster’s available IOPs capacity

48 Architecture Pravega Streaming Service Streaming Storage SystemMessaging Apps Real-Time / Batch / Interactive Predictive Analytics Stream Processors: Spark, Flink, … Other Apps & Middleware Streaming Storage System Pravega Design Innovations Zero-Touch Dynamic Scaling Automatically scale read/write parallelism based on load and SLO No service interruptions No manual reconfiguration of clients No manual reconfiguration of service resources Smart Workload Distribution No need to over-provision servers for peak load I/O Path Isolation For tail writes For tail reads For catch-up reads Tiering for “Infinite Streams” Transactions For “Exactly Once” Stream Abstraction Pravega Streaming Service Cache (Rocks) Cloud Scale Storage (HDFS) High-Throughput High-Scale, Low-Cost Low-Latency Storage Auto-Tiering Apache Bookkeeper

49 Pravega Optimizations for Stream ProcessorsDynamically split input stream into parallel logs: infinite sequence, low-latency, durable, re-playable with auto-tiering from hot to cold storage. 1 Support streaming write COMMIT operation to extend Exactly Once processing semantics across multiple, chained applications 3 Stream Processor App State App Logic Worker Output Stream (Pravega) Processor 2nd App Segment Sink Input Stream (Pravega) Segment Social, IoT Producers Worker Segment Coordinate via protocol between streaming storage and streaming engine to systematically scale up and down the number of logs and source workers based on load variance over time 2 Memory-Speed Storage

50 Comparing Pravega and Kafka Design PointsUnlike Kafka, Pravega is designed to be a durable and permanent storage system Quality Pravega Goal Kafka Design Point Data Durability Replicated and persisted to disk before ACK Replicated but not persisted to disk before ACK Strict Ordering Consistent ordering on tail and catch-up reads Messages may get reordered Exactly Once Producers can use transactions for atomicity Messages may get duplicated Scale Tens of millions of streams per cluster Thousands of topics per cluster Elastic Dynamic partitioning of streams based on load and SLO Statically configured partitions Size Log size is not bounded by the capacity of any single node Partition size is bounded by capacity of filesystem on its hosting node Transparently migrate/retrieve data from Tier 2 storage for older parts of the log External ETL required to move data to Tier 2 storage; no access to data via Kafka once moved Performance Low (<10ms) latency durable writes; throughput bounded by network bandwidth Low-latency achieved only by reducing replication/reliability parameters Read pattern (e.g. many catch-up readers) does not affect write performance Read patterns adversely affects write performance due to reliance on OS filesystem cache

51 Attributes Connector Streaming Exactly once DurabilityStorage Capacity HDFS No Yes Years Kafka  Source only Yes* (Flushed but not synced) Days Pravega Yes: Byte oriented and event oriented Yes. With either idempotent producers, or transactions Yes. Always flushed and synced, with low latency. As much as you can fit in your HDFS cluster.

52 Architecture overview - WriteUpdate Metadata / cache 3 ACK 4 Check Metadata 1 Metadata + Cache Record data to Log 2 Apache BookKeeper

53 Architecture overview - ReadCheck Metadata 1 Metadata + Cache Pull from cache 2.a Pull from HDFS 2.b Apache BookKeeper

54 Architecture overview - EvictMetadata + Cache Write contents from cache to HDFS 1 Mark data for removal 2 Apache BookKeeper

55 Architecture overview - RecoverMetadata + Cache Re-populate metadata/cache from Bookkeeper 2 Take ownership of BK Ledger 1 Apache BookKeeper