Spark Streaming: Fault-tolerant Streaming Computation at Scale
Matei Zaharia, UC Berkeley
September 25, 2013, 11:00 – 12:00 in the Visualization Studio
Many “big data” applications need to act on data arriving in real time. Running these applications at ever-larger scales requires parallel execution platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-streams), that overcomes these challenges. D-streams support a parallel recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databases, and also handles stragglers. We show that D-streams can support a rich set of streaming operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, the D-stream model can seamlessly be composed with batch and interactive query models for clusters (e.g. MapReduce), enabling rich applications that combine these modes. We have implemented D-streams in Spark Streaming, an extension to the Spark cluster computing framework.
Matei Zaharia finishing his PhD at UC Berkeley, where he worked with Scott Shenker and Ion Stoica on topics in large-scale data processing and cloud computing. After Berkeley, he will be starting an assistant professor position at MIT. During his PhD, Matei has also been an active open source contributor, becoming a committer on the Apache Hadoop project and starting the Mesos and Spark projects.