Discover how we're achieving our vision of unifying data in real-time by exploring our team's thoughts, ideas, and experiences.
Windowing queries in stream processing play a pivotal role in handling time-series data. This post unravels how to harness streaming-friendly window functions in queries with just using ANSI-SQL, emphasizing the importance of ordering for achieving optimal results in streaming datasets.
The Sliding Window Hash Join (SWHJ) algorithm joins potentially infinite streams while preserving the order by building hash tables incrementally, storing only relevant rows from the build side that fall within a sliding window, allowing efficient processing of streams without materializing all data.
The Count-Min Sketch uses hash functions to map streamed items into a 2D counter array. When processing the stream, items are hashed to incremented counters, frequencies are est. by taking the min count across rows for an item's hashes.
Sliding window join for stream processing brings Datafusion a step closer to unified data processing. Find out how to efficiently join the streams with less memory usage and how to intelligently buffer both join sides.
On September 27, 2024, the first Apache DataFusion Meetup in Europe took place in Belgrade, bringing together nearly 70 attendees. The event was held at the Microsoft office, where speakers showcased their work and shared insights on how they are utilizing DataFusion in various projects.
Apache DataFusion has been elevated to a Top-Level Project by the Apache Software Foundation, underscoring its maturity and essential role in data processing. This recognition reflects DataFusion's rapid growth, robust performance, and active community engagement.
Our CEO Ozan recently joined an episode of the Streaming Caffeine podcast — Streaming Caffeine E10: Ozan from Synnada, about Arrow Datafusion, Rust, Databases, SQL, AI — to discuss our perspective on DataFusion and the future of data infrastructure.
This post explores how pioneering teams at Airbnb, Uber, and Apache Arrow overcame the data chasm, followed by an introduction to the Lean Data Stack paradigm as a way to build durable, economical, and flexible data systems.
The data ecosystem is rapidly expanding and fragmenting, posing integration challenges industry-wide. Many companies fall into a "data chasm", needing to abruptly scale their tools from 2-4 to 15-20, exacerbating complexity. Some organizations pioneered methodologies to cross this chasm and extract value. How can others navigate this data chasm?
This blog post explores the AI/ML landscape, comparing it to a gold rush where the focus is on providing "specialized electricity" in the form of computing, storage, and networking resources.
The world of AI and data is undergoing a rapid transformation. Enabling technologies are maturing to a level where we should be able to deploy action-capable, autonomous intelligent agents at scale. But what will it take to make this a reality?