Data cardinality is a measure of the uniqueness of values in a dataset, column or field. Low-cardinality data has a small number of distinct values. High-cardinality data has a wider variety of values.
Cardinality along with data volume, distribution, and other statistical properties have significant impact on choices like data compression, indexing, partitioning, and query performance.
For example, low cardinality columns may benefit more from dictionary encoding while high cardinality columns can be better served by run-length encoding in columnar formats.
Estimating cardinality helps database optimizers choose optimal query plans and engine components. Columnar analytics engines like Apache Arrow DataFusion leverage cardinality estimation for incremental processing over changing data.
Low cardinality enables high compression ratios, reduces index size, and improves query engine performance through better predicate filtering and aggregation. High cardinality data is harder to compress, index and process efficiently.
Data architectures optimize storage and access methods based on cardinality. Sorting on high-cardinality columns is avoided.
Understanding data cardinality allows properly configuring and optimizing databases. It provides critical inputs for physical data design covering storage formats, compression, indexing, partitioning, materialized aggregates, and more to maximize performance.
All major database systems highly optimize based on detected or user-specified cardinality statistics.
Cardinality is measured by analyzing schema, sampling actual values, tracking distinct counts during query execution, and collecting query feedback statistics.
Low cardinality examples: state, status, boolean, country, title.
High cardinality: email, name, GUID, number, timestamp.
Low-cardinality dimensions allow better aggregation and pre-summarization. High-cardinality facts are harder to pre-aggregate without excessive storage overhead.
Estimating cardinality is hard for skewed data, correlations, multi-column combinations. Bad estimates lead to poor query plans. Adaptive and ML approaches help.
Data cardinality refers to the uniqueness of data values in a particular column or dataset, which has significant impacts on data storage, processing and querying.
Read more ->Distributed execution refers to techniques to execute database queries efficiently across clustered servers or nodes, dividing work to utilize parallel resources.
Read more ->Apache DataFusion is an extensible, high-performance data processing framework in Rust, designed to efficiently execute analytical queries on large datasets. It utilizes the Apache Arrow in-memory data format.
Read more ->