Distributed Execution

Query Execution
Updated on:
September 17, 2024

What is distributed execution?

Distributed execution is running a database query in parallel across multiple servers or nodes. It allows scaling out query processing over clustered commodity infrastructure.

Databases designed for distributed execution split query stages across nodes holding subsets of partitioned data, coordinating parallelism for faster results.

Distributed execution works together with query executionparallel execution within a node, and partitioning strategies to optimize performance and scalability across large datasets and clusters. Distributed query engines handle node communication, Failure recovery, and other aspects of coordinated execution across nodes.

How does it work?

In a distributed database, queries execute after optimization by:

  • Parsing queries and dispatching plan fragments to nodes.
  • Scanning local data shards and sending subsets to aggregators.
  • Shuffling intermediate data across cluster interconnect.
  • Combining data from parallel stages returned by nodes.

Distributed execution frameworks manage coordination, data transfers, progress tracking, parallelism.

Why does it matter?

Distributing query execution harnesses resources of many commodity servers to create scale-out shared-nothing architectures cost effectively.

It provides flexibility to elastically grow compute for larger workloads. By dividing work across nodes, individual servers handle a fraction of load, improving performance.

FAQ

When is distributed execution suitable?

Distributed execution helps for:

  • Scaling analytic workloads across clusters.
  • Parallelizing long running query workflows.
  • Reducing costs by using commodity hardware.
  • Lowering latency by spreading load across servers.
  • Scaling storage for massive datasets.

What are some key design considerations?

Some key design aspects for distributed execution include:

  • Handling data skew across nodes.
  • Tuning cost of data movements across nodes.
  • Minimizing stragglers and tail latencies.
  • Detecting and working around failures.
  • Race conditions from lack of shared memory.

What are examples of distributed databases?

Some popular distributed databases utilizing clustered execution are:

  • Amazon Aurora - MySQL and PostgreSQL compatible relational database.
  • Citus - PostgreSQL-based distributed RDBMS.
  • CockroachDB - Scalable, survivable distributed SQL database.
  • Greenplum - MPP analytic database based on PostgreSQL.
  • Apache HAWQ - Hadoop-based distributed SQL query engine.

References:


Related Entries

Query Execution

Query execution is the process of carrying out the actual steps to retrieve results for a database query as per the generated execution plan.

Read more ->
Parallel Execution

Parallel execution refers to techniques for speeding up database query processing by leveraging multiple CPUs, servers, or resources concurrently.

Read more ->
Partitioning

Database partitioning refers to splitting large tables into smaller, independent pieces called partitions stored across different filegroups, drives or nodes.

Read more ->

Get early access to AI-native data infrastructure