Why Columnar Databases Are the OGs in Analytical Tasks

aibunny
4 min readApr 5, 2024

--

Working as an analytical engineer in fintech, I’m faced with the challenge of building models and using data engineering architectures that allow real time analytics. To achieve all this I have to use use the best database available that can support stream processing and with the least query time possible even for the most complex queries. Through out my research I have come across multiple real time analytical databases such as clickhouse, kinetica, rockset, parade db and hydra but one thing stands out for all of them and is that they’re all columnar databases.

This begs the question, what are columnar databases? and how are they different from traditional row based databases such as MYSQL and “Just Postgres” (I’ll come back to this later on in one of blogs).

The Row vs. Column Dilemma: Understanding the Divide.

To actually show the technical difference in this two database types let’s get to understand of how row based databases work.

Row-oriented databases : this kind of databases store data row by row, with each row containing all the fields for a single record, similar to how data is arranged in a spreadsheet. Each row represents a single record or entry, and each column represents a specific attribute or field of that record.

Look at this spread sheet with data about three books.

Let’s visualize how data for three books might be stored in a row based db :

| book_id | title  | author   | genre   | publication_year | price |
|---------|--------|----------|---------|------------------|-------|
| 1 | Book A | Author A | Fiction | 2000 | 20.00 |
| 2 | Book B | Author B | Romance | 2015 | 15.00 |
| 3 | Book C | Author C | Mystery | 2020 | 25.00 |

Columnar database: this databases store data in columns instead of rows, with each column containing all the values for a specific field across multiple records.

Let’s visualize how data for three books might be stored in a column based db :

| book_id | 1       | 2       | 3       |
| title | Book A | Book B | Book C |
| author | Author A| Author B| Author C|
| genre | Fiction | Romance | Mystery |
| pub_year| 2000 | 2015 | 2020 |
| price | 20.00 | 15.00 | 25.00 |

You will notice that in the table representation, all the values for the book_id column are stored together, followed by all the values for the title column, then the author column, and so on. This is the columnar storage format, where data is stored column by column, rather than row by row as in traditional row-oriented databases.

Advantages of this approach.

1. Efficient Data Storage:

For example, in the table of books, one column stores all the titles, another stores all the authors, and so forth. This structure allows for highly efficient storage and retrieval of data, especially for analytical queries that often access specific columns rather than entire rows.

2. Optimized Query Performance:

Analytical queries typically involve aggregating or analyzing specific attributes across large datasets. Columnar databases excel in this scenario because they can swiftly access and process only the columns relevant to the query. Unlike row-based databases, which may need to scan entire rows even for simple attribute-based queries, columnar databases can quickly skip irrelevant data, leading to significantly faster query performance.

Example:

Looking at this SQL query


SELECT SUM(price)
FROM books
WHERE publication_year > 2010;

Though the syntax is the same in both the row based databases and columnar databases. This is what happens:

Row-Based Database Query:

  • In a row-based database, the query engine needs to scan through the entire rows of the books table to evaluate the condition publication_year > 2010.
  • Even though we are only interested in the price column, the database engine must read and process all columns for each row that meets the condition.
  • This can result in higher I/O overhead and longer query execution times, especially for large datasets.

Columnar Database Query:

  • In a columnar database, the query engine can efficiently access and process only the price column for instances where publication_year > 2010.
  • Since the data is physically stored by column, the database engine can quickly skip over irrelevant columns, such as title, author, genre, etc., during query execution.
  • As a result, the query performance is significantly faster compared to the row-based approach, especially for analytical queries involving large datasets.

3. Data Compression and Encoding:

Columnar databases employ sophisticated techniques such as data compression and encoding to further enhance storage efficiency and query performance. By compressing similar data values within a column and using encoding schemes tailored to specific data types, columnar databases minimize storage footprint and accelerate data access. This results in reduced storage costs and faster query execution, making them ideal for handling large volumes of analytical data.

4. Parallel Processing:

Another ace up the sleeve of columnar databases is their ability to leverage parallel processing capabilities. Analytical workloads often involve complex queries that can benefit from parallel execution across multiple CPU cores or distributed nodes. Columnar databases are designed to exploit parallelism efficiently, enabling them to tackle high-concurrency analytical tasks with ease and agility.

5. Scalability and Flexibility:

Scalability is a cornerstone of modern data infrastructure, and columnar databases are built to scale seamlessly with growing data volumes and user demands. Whether it’s scaling vertically by adding more resources to a single server or horizontally by distributing data across multiple nodes in a cluster, columnar databases offer flexibility and adaptability to evolving analytical needs.

--

--

No responses yet