Glossary

The Synnada glossary explains key terms and concepts in data science, machine learning, AI, and analytics. Learn about popular ML algorithms, data engineering, statistics and more from our comprehensive tech glossary.

Algorithms/Data Structures

Columnar Memory Format

Columnar memory format stores data in columns rather than rows, allowing for compression and reads optimized for analytics queries.

Read more ->

Data Cardinality

Data cardinality refers to the uniqueness of data values in a particular column or dataset, which has significant impacts on data storage, processing and querying.

Read more ->

Skip List

A skip list is a probabilistic data structure that provides fast search and insertion over an ordered sequence using hierarchy of linked lists to skip over elements.

Read more ->

B-tree

A B-tree is a tree data structure optimized for fast indexed key lookups and writes on disk storage while keeping the tree balanced.

Read more ->

Distributed Hash Table

A distributed hash table (DHT) is a decentralized distributed system that partitions a key space across nodes and uses hash functions to assign ownership and locate data.

Read more ->

Consistent Hashing

Consistent hashing is a distributed hash technique that minimizes redistribution of keys when servers are added or removed, used in systems needing scalability and high availability.

Read more ->

Bloom Filter

A Bloom filter is a probabilistic data structure used to test set membership that is space-efficient compared to storing the full set.

Read more ->

FNV Hash

The FNV hash is a fast, simple non-cryptographic hash function that uses modular arithmetic operations to achieve good distribution.

Read more ->

MurmurHash

MurmurHash is a series of fast non-cryptographic hash functions optimized for hash tables and CPU cache performance.

Read more ->

xxHash

xxHash is an extremely fast non-cryptographic hash algorithm focused on speed and efficiency for checksums and hash tables.

Read more ->

Interval Arithmetic

Interval arithmetic is a method of computing with sets of numbers rather than single values, representing uncertainty in calculations and accounting for rounding errors.

Read more ->

Probabilistic Data Structures

Probabilistic data structures are space and time efficient data structures that use randomized algorithms to provide approximate results to queries with strong guarantees.

Read more ->

CAP Theorem

The CAP theorem states that distributed data systems can only support two of the three guarantees: consistency, availability and partition tolerance.

Read more ->

Hash Functions

Hash functions are algorithms that map data of arbitrary size to fixed-size values called hashes in a deterministic, one-way manner for purposes like data integrity and database lookup.

Read more ->

Collision Resistance

Collision resistance is the property of cryptographic hash functions to minimize chances of different inputs mapping to the same output hash, making it difficult to intentionally cause collisions.

Read more ->

Count Min Sketch

A Count Min Sketch is a probabilistic data structure used to estimate item frequencies and counts in data streams.

Read more ->

Data Pruning

Data pruning refers to database techniques that eliminate irrelevant data during query processing to minimize resource usage and improve performance.

Read more ->

Data Processing

ETL Data Processing

ETL (Extract, Transform, Load) data processing refers to the steps used to collect data from various sources, cleanse and transform it, and load it into a destination system or database.

Read more ->

DataFrame

A DataFrame is a two-dimensional tabular data structure with labeled columns and rows, used for data manipulation and analysis in data science and machine learning workflows.

Read more ->

Distributed Tracing

Distributed tracing is a method used to profile and monitor complex distributed systems by instrumenting apps to log timing data across components, letting operators analyze bottlenecks and failures.

Read more ->

Incremental Processing

Incremental processing involves continuously processing and updating results as new data arrives, avoiding having to recompute results from scratch each time.

Read more ->

Online Analytical Processing (OLAP)

Online analytical processing (OLAP) refers to the technology that enables complex multidimensional analytical queries on aggregated, historical data for business intelligence and reporting.

Read more ->

Lambda Architecture

Lambda architecture is a big data processing pattern which combines both batch and real-time stream processing to get the benefits of high throughput and low latency querying.

Read more ->

Unified Processing

Unified processing refers to data pipeline architectures that handle batch and real-time processing using a single processing engine, avoiding the complexities of hybrid systems.

Read more ->

Batch Processing

Batch processing is the execution of a series of programs or jobs on a set of data in batches without user interaction for efficiently processing high volumes of data.

Read more ->

Kappa Architecture

Kappa architecture is a big data processing pattern that uses stream processing for both real-time and historical analytics, avoiding the complexity of hybrid stream and batch processing.

Read more ->

Query Execution

Outer Joins

An outer join returns all rows from one or both tables in a join operation, including those without matching rows in the other table. It preserves rows even when no related matches exist.

Read more ->

Inner Joins

An inner join is a type of join operation used in relational databases to combine rows from two tables based on a common column between them.

Read more ->

User Defined Functions (UDF)

A user-defined function (UDF) is a programming construct that allows developers to create custom functions in a database, query language or programming framework to extend built-in functionality.

Read more ->

Execution Framework

An execution framework is a distributed system that automates and manages aspects like resource allocation, scheduling, fault tolerance and execution of large-scale computational jobs.

Read more ->

Memory Management

Memory management refers to the allocation, deallocation and organization of computer memory resources for running programs and processes efficiently.

Read more ->

Query Optimization

Query optimization involves rewriting and transforming database queries to execute more efficiently by performing cost analysis to find faster query plans.

Read more ->

Partitioning

Database partitioning refers to splitting large tables into smaller, independent pieces called partitions stored across different filegroups, drives or nodes.

Read more ->

Distributed Execution

Distributed execution refers to techniques to execute database queries efficiently across clustered servers or nodes, dividing work to utilize parallel resources.

Read more ->

Parallel Execution

Parallel execution refers to techniques for speeding up database query processing by leveraging multiple CPUs, servers, or resources concurrently.

Read more ->

Query Execution

Query execution is the process of carrying out the actual steps to retrieve results for a database query as per the generated execution plan.

Read more ->

Data Storage and Sources

SQL Compatibility

SQL compatibility refers to the degree to which a database or analytics system supports the SQL query language standard, enabling the use of standard SQL syntax and features.

Read more ->

Graph Database

A graph database stores data in a graph structure with nodes, edges and properties to represent and query relationships between connected data entities.

Read more ->

Key-value Store

A key-value store is a type of NoSQL database optimized for storing, retrieving and managing associative arrays of key-value pairs.

Read more ->

Data Warehouse

A data warehouse is a centralized data management system designed to enable business reporting, analytics, and data insights.

Read more ->

Message Broker

A message broker is a software system that facilitates communications between distributed applications and services by transferring messages in a reliable and scalable manner.

Read more ->

Time-series Database (”TSDB”)

A time-series database (TSDB) is a database engineered and optimized for handling time-series data, where each data point contains a timestamp.

Read more ->

Relational Database

A relational database is a type of database that stores and provides access to data according to relations between defined entities organized in tables.

Read more ->

Data Processing Engine

A data processing engine is a distributed software system designed for high-performance data transformation, analytics, and machine learning workloads on large volumes of data.

Read more ->

Data Lake

A data lake is a scalable data repository that stores vast amounts of raw data in its native formats until needed.

Read more ->

Document Store

Document store database manages collections of JSON, XML, or other hierarchical document formats, providing querying and indexing on document contents.

Read more ->

Spatial Database

A spatial database is a database optimized to store, query and manipulate geographic information system (GIS) data like location coordinates, topology, and associated attributes.

Read more ->

RDF Store

An RDF store is a graph database optimized for storing and querying RDF triple data to represent facts and relationships.

Read more ->

Data Orchestrator

A data orchestrator is a middleware tool that facilitates the automation of data flows between diverse systems such as data storage systems (e.g. databases), data processing engines (e.g. analytics engines) and APIs (e.g. SaaS platforms for data enrichment).

Read more ->

Vector Database

A vector database is designed to efficiently store and query vector representations of data for applications like search, recommendations, and AI.

Read more ->

Search Engine (Database)

A search engine database is designed to store, index, and query full text content to enable fast text search and retrieval.

Read more ->

Core Tech

Apache Arrow DataFusion

Apache DataFusion is an extensible, high-performance data processing framework in Rust, designed to efficiently execute analytical queries on large datasets. It utilizes the Apache Arrow in-memory data format.

Read more ->

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data, specifying a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations.

Read more ->

Get early access to AI-native data infrastructure