Skip to content

Stream Processing Fundamentals🔗

  • Why streaming ? More often we find ourselves performing a long computation but we want something of value while computation is going on e.g. report about customer activity or a new machine learning model.
  • In 2012, the project incorportaed Spark Streaming and its DStreams API. DAPIs enable stream processing using high-level functional operators like map and reduce.
  • Hundreds of orgs now use DStreams in production for large real-time application, often processing terabytes of data per hour.M uch like the Resilient Distributed Dataset (RDD) API, however, the DStreams API is based on relatively low-level operations on Java/Python objects that limit opportunities for higher-level optimization.
  • Thus in 2016 Spark added Structured Streaming, built directly on DataFrames that supports both rich optimizations and significantly simple integration with other DataFrame and Dataset code.
  • If you are interested in DStreams, many other books cover that API, including several dedicated books on Spark Streaming only, such as Learning Spark Streaming by Francois Garillot and Gerard Maas (O’Reilly, 2017).
  • Much as with RDDs versus DataFrames, however, Structured Streaming offers a superset of the majority of the functionality of DStreams, and will often perform better due to code generation and the Catalyst optimizer.

What is Steam Processing🔗

  • act of continuously adding new data to compute a result. In stream processing, input data in unbounded and has no beginning or end, simply forms a series of events that arrive at the stream processing system (e.g. credit card transactions, clicks on a website, IOT devices)
  • User/Applications can compute various queries over this stream and output multiple versions of the results as it runs, or keep it upto date in and external “sink” system such as a key-value store.
  • In Batch processing often results are computed once on static data.
  • Although stream and batch sound like different strategies, in practice both are often employed together. e.g. streaming application often need to join input data against a dataset written periodically by a batch job, and output of streaming jobs often files or tables that are queried in batch jobs.

Stream Processing Use Cases🔗

Notifications and Alerting🔗

  • given some series of events, a notification or alert should be triggered if some sort of event or series of events occur.
  • it doesn’t limit to autonomous or pre-programmed decision making, but rather notifying with a counterpart of some action to be taken on the fly

Real-time reporting🔗

  • Real-time dashboarding
  • monitoring some resource, load, uptime, etc.

Incremental ETL🔗

  • reduce latency companies must endure while retrieving information into a data warehouse.
  • Spark batch jobs are often used for ETL Workloads, Using Structured Streaming these jobs can incorporate new data withing seconds, enabling users to query it faster downstream.
  • NOTE: here data needs to processed exactly once in a fault tolerant manner.

Update data to serve in real time🔗

  • compute data that gets server interactively by another application
  • ouput of web analytics product such as Google Analytics might continuosly track visits to each page and use a streaming system to keep up to date these counts

Real time decision making🔗

  • analyzing new inputs and responding to them automatically using business logic. An example is credit card transaction fraud detection.

online machine learning🔗

  • training a model on a combination of streaming and historical data from multiple users.

Advantages of Stream Processing🔗

  • enables lower latency - when application needs to respond quickly (minutes, seconds, milliseconds)
  • more efficient in updating a result than repeated batch jobs, because of automatic incrementalizes the computation.

Challanges of Stream Processing🔗

  • Assumes folowing output events from sensor
{value: 1, time: "2017-04-07T00:00:00"}
{value: 2, time: "2017-04-07T01:00:00"}
{value: 5, time: "2017-04-07T02:00:00"}
{value: 10, time: "2017-04-07T01:30:00"}
{value: 7, time: "2017-04-07T03:00:00"}
  • Notice, how last packet came out of order and later than others, responding to a specific event (5) is much easier as compared to specific sequence of values in stream (2->5->10)
  • Solving above problem require stream have memory of past (state). If sensor output million records, it might become a nightmare.
  • Summarize
    • Processing out-of-order data based on application timestamps (also called event time)
    • Maintaining large amounts of state
    • Supporting high-data throughput
    • Processing each event exactly once despite machine failures
    • Handling load imbalance and stragglers
    • Responding to events at low latency
    • Joining with external data in other storage systems
    • Determining how to update output sinks as new events arrive
    • Writing data transactionally to output systems
    • Updating your application’s business logic at runtime

Stream Processing Design Points🔗

Record-at-a-Time v/s Declarative APIs🔗

  • simplest way to design a streaming API is to pass each event to application and let it react usin custom logic, this approach seems useful when application wants to have full control over data processing.
  • however downside is most of the above complications are now application’s problem!. You need to maintain states, responding to failures etc.
  • Newer streaming system provide declarative APIs, where application specify what to compute but not how to compute in response to each new event and how to recover on failures. e.g. Spark’s original DStream lib used to do that offering map, reduce and filter on streams.

Event Time vs Processing Time🔗

for systems with declarative APIs second problem is does the system natively supports event time ? Event time is idea of processing data based on timestamps inserted into each record at the source, as opposed to time when record is received (processing time).

  • When using event times, records may come at different times and out of order (network latency), If system derives important information/patterns from order of information then you may be in big trouble.
  • If application only processes local events only we may not sophisticated event-time processing.

Because of this, many declarative systems, including Structured Streaming, have “native” support for event time integrated into all their APIs, so that these concerns can be handled automatically across your whole program.

Continuous vs Micro-Batch Execution🔗

latency determines usage of one over other. Continous Execution provide low latency (at least when input rate is low) while Micro-Batching does take some time to execute.

As input rate of data scales we are better off using Micro-Batching to avoid overhead of per-record processing.

Micro Batches can wait to accumulate small batches of input data (500ms worth of data) then each bacth can be processed parallely on a cluster of machines and we can often obtain high throughput per node because of batch systems optimizations.

Spark’s Streaming APIs🔗

Spark has two APIs for streaming

The DStream API🔗

  • original API of spark since 2012 release. Many companies use it in production. Interactions with RDD code, such as joins with static data, are also natively supported in Spark Streaming.
  • DStreams has several limitations tho
    • Based purely on Java/Python objects and functions, as opposed to riched concept of structured tables in DataFrames and Datasets which limits the engine’s oppotunity to perform optimizations
    • API is purely based on processing time-to handle event-time operations
    • DStreams can only operate in a micro-batch fashion and exposes the duration of micro batches in some parts of its API, making it difficult to support alternative execution modes

Structured Streaming🔗

  • higher level streaming API formed from Spark’s Structured API
  • Available on all supported languages
  • Native support for event time data, all of its the windowing operators automatically support it.
  • Structured Streaming doesn’t use a separate API from DataFrames.
  • As another example, Structured Streaming can output data to standard sinks usable by Spark SQL, such as Parquet tables, making it easy to query your stream state from another Spark applications