What is unified processing?
Unified processing is an approach to designing big data architectures where the same engine handles both batch and streaming workloads, rather than having separate batch and stream processing systems.
This simplifies architecture by avoiding the need to merge or coordinate across distinct platforms. Workloads can take advantage of both batch and stream processing capabilities in a unified way.
Unified processing is a key enabler of the Kappa architecture pattern. Modern stream processors have adopted unified processing capabilities to replace specialized batch processing engines used in Lambda architecture.
What does unified processing do? How does it work?
In a unified system, a single engine like Flink or Spark processes data using the same execution engine, APIs, and storage layer whether it is bounded batch data or unbounded streams.
The engine provides common abstractions to handle both use cases. This enables easy switching between stream and batch views of data.
Why is it important? Where is it used?
Unified processing reduces complexity compared to hybrid approaches. It improves developer productivity and makes operational management easier.
Use cases include web and mobile analytics, data pipelines, IoT, fraud detection and other applications requiring flexibility between batch and real-time processing. Unified solutions can replace lambda/kappa architectures.
FAQ
How does unified processing contrast with lambda architecture?
Unified processing eliminates the need to combine separate systems. There is just one way to process and query all data.
What are some key unified processing technologies?
Examples include Apache Flink, Apache Spark Structured Streaming, Amazon Kinesis Data Analytics, Google Cloud Dataflow and Azure Stream Analytics.
What are key benefits of unified processing?
Benefits include simplified development, no context switching between different semantics, reduced operational complexity and easier ways to process data as streams or batches.
What are potential downsides of unified processing?
Immaturity compared to specialized engines, possible performance or cost impacts of generality, limitations in very high scale stream processing scenarios.
When is unified processing appropriate?
Unified approaches excel for use cases like:
- Flexibility in batch and stream processing needed
- Reduced operational complexity desired
- Tight integration between stream and historical data required
References:
- [Book] Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing
- [Article] Apache Spark: a unified engine for big data processing
- [Documentation] Apache Arrow DataFusion
- [Post] General-purpose Stream Joins via Pruning Symmetric Hash Joins
- [Post] Running Windowing Queries in Stream Processing