Hydra: Open-Source Database Management System
Today, I’d like to introduce you to Hydra, a versatile open-source DBMS. In this article, I will explain how to use Hydra and break down its basic concepts, including loading data and running queries.
So, What Is Hydra?
Hydra is an open-source analytic DBMS that you can find on GitHub. It is based on PostgreSQL, a very popular transactional DBMS, making it more versatile than other DBMSs for data analytics. One of the fundamental differences between an analytic and a transactional DBMS is that the former stores each field’s data in separate files, while the latter stores multiple record data in a single file. And you’ll see that with Hydra, things are a bit more interesting.
Hydra’s creators offer cloud solutions for all tastes, depending on your needs.
However, you can also build a DBMS from the GitHub source code. For this article, I took advantage of a 14-day free trial of the cloud platform.
Data Storage Features
Like many analytic DBMSs, Hydra features columnar data storage. This approach allows for better data compression since only one type of value can be in one field. The compression not only reduces the space required but also allows you to read the data more efficiently. Analytical queries require reading only a few fields from the table; with columnar storage, unnecessary column files will not be read at all, unlike the row-based approach to data storage.
At the same time, for transactional tasks, when you need to read one record in full and replace the values of some fields, row storage is the most efficient: that is, when table records are stored in the file one by one. The interesting thing is that in Hydra, you can choose which of the two options to use. The fact that this DBMS was developed based on PostgreSQL makes it a full-fledged transactional DBMS. At the same time, the ability to store tables in columns adds to Hydra’s versatility.
A Hydra-based database can contain tables with different data storage options, each for its task. For example, the main table with customer actions can store data line by line and be a regular PostgreSQL table, and next to it, on the same server, there can be the same table with columnar storage used by analysts with their heavy queries. Combining different types of tables in one query can be amazing!
Let’s create a table with column and row storage for a dataset of homes sold in the UK from 1995 till now.
This is what the script would look like to create a standard table from PostgreSQL with row storage. The same script will also work for Hydra.
CREATE TABLE uk_price_paid_row (
price INT,
date DATE,
postcode1 TEXT,
postcode2 TEXT,
TYPE SMALLINT,
is_new SMALLINT,
duration SMALLINT,
addr1 TEXT,
addr2 TEXT,
street TEXT,
locality TEXT,
town TEXT,
district TEXT,
county TEXT
);
For the table to store data by columns, you only need to add one line at the end of the script:
CREATE TABLE uk_price_paid_col (
price INT,
date DATE,
postcode1 TEXT,
postcode2 TEXT,
TYPE SMALLINT,
is_new SMALLINT,
duration SMALLINT,
addr1 TEXT,
addr2 TEXT,
street TEXT,
locality TEXT,
town TEXT,
district TEXT,
county TEXT
) USING columnar;
Now the table can be used as the primary one for storing data and transactional loads, and the other can be utilized for analytical tasks.
Transactional tasks are usually referred to as OLTP (Online Transaction Processing), while analytical tasks are characterized as OLAP (Online Analytical Processing). Usually, every DBMS is chosen for its workload, but Hydra stands out because it is designed for both OLTP and OLAP workloads. This combination of the two workload types gave rise to the term “HTAP workload” (Hybrid Transaction/Analytical Processing) — the kind of workload Hydra was designed for.
Partitioning
Another fascinating feature of Hydra tables is partitioning, which is splitting a table into several logically independent chunks. Partitioning involves the division of all table entries into groups, a very convenient feature for working with a large number of rows at once.
In Hydra, partitioning is conceptually similar to table partitioning in PostgreSQL. When creating a table, you must specify the field with values to be partitioned, as well as the type of partitioning; you can learn more in the PostgreSQL documentation. Using our previous example, when creating a table, I’m going to partition by months of house sales, and the script will look like this:
CREATE TABLE uk_price_paid (
price INT,
date DATE,
postcode1 TEXT,
postcode2 TEXT,
TYPE SMALLINT,
is_new SMALLINT,
duration SMALLINT,
addr1 TEXT,
addr2 TEXT,
street TEXT,
locality TEXT,
town TEXT,
district TEXT,
county TEXT
) PARTITION BY RANGE(date);
I intentionally didn’t specify the type of table data storage in the script, but more on that later. Each partition will store data for a certain period of date field values. And for this example, I’d like to store data for each year in a different partition. To achieve this, I need to create each partition manually, specifying the date field boundaries; this is a partitioning script for 1995 records:
CREATE TABLE uk_1995 PARTITION OF uk_price_paid
FOR VALUES FROM (‘1995-01-01’) TO (‘1996-01-01’);
The partition-creation script explicitly says that we have created a table that is a partition of the original table. This is another “killer feature” of the Hydra DBMS. Since each partition is a separate table, you can specify a different way to store data for each partition. In the example above, the partition is created with a row-storage type, but you can use the columnar-storage method too. The script for such a query will be as follows:
CREATE TABLE uk_1995 PARTITION OF uk_price_paid
FOR VALUES FROM (‘1995-01-01’) TO (‘1996-01-01’)
USING COLUMNAR;
For table uk_price_paid, I can create a partition for each year and another for the current year with a row data storage, and for the others — with a column data storage. I suggest creating the latest partition with a row type, assuming the load on this data will be predominantly transactional. The data for the rest of the years will not be changed; it’s needed only to build analytical reports. That is, you don’t need to specify the storage method in the initial table because each partition will have its own.
Hydra’s flexibility is very convenient for combining two different workload types, OLTP and OLAP, which is often difficult to achieve in classic columnar DBMS.
Updating and Deleting Data
It is recommended to partition tables by load because updating and deleting records in a table with a columnar-storage type is less efficient.
Currently, data deletion is implemented as follows: when deleting from a table by using the DELETE query, the record is marked as deleted and won’t be read by the next SELECT query. The actual deletion occurs when calling the VACUUM FULL garbage-collect query, which is well-known among PostgreSQL users. However, this method is inefficient because calling the “garbage collector” with this parameter will result in a complete locking of the table.
Updating records in a table with columnar storage is similar to deleting. The UPDATE query marks each updated row as deleted (just like when you delete data with the DELETE query) and adds new versions of records at the end. In essence, an UPDATE query is a DELETE query followed by an INSERT query.
For example, if we want to update the price in the second record from this table part:
Then we would have to delete the entire row and insert the same one but with the new value of the price field:
It’s worth also keeping in mind the isolation level with which records are deleted and updated. Each default query will create a separate transaction with an isolation level of READ COMMITTED. If this isolation level is not suitable for your data update-and-deletion scenario, change the default isolation level or explicitly run one or more requests to change or delete data within the transaction with a specified isolation level.
It’s better to delete and update data within tables or partitions created with a row type of data storage and then send it to the storage with a column type.
How to Run Analytical Queries
Column storage is not the only way to optimize the execution of analytical queries. For each query, the resources of several processor cores are used simultaneously, executing the query in parallel.
Let’s study this algorithm with a simple query to our table uk_price_paid, which was created with columnar storage:
SELECT
DATE_PART(‘year’, date) AS year,
ROUND(AVG(price)) AS avg_price
FROM
uk_price_paid
GROUP BY
year
ORDER BY
year ASC;
The query determines the average property value for each year. If the table has been preliminarily partitioned by year, the AVG aggregate function will be calculated simultaneously for each partition in parallel.
The data of each column is divided into vectors for calculation, and each thread will process a different vector. I have marked each vector with a color (there are 6 vectors in the figure). Since the table has been partitioned by year, it’s likely the computation of each column will first be divided between threads by partitions. If Hydra has more threads than partitions, the column files will be divided into even more vectors, thus increasing the parallelism of the query execution. This query-execution algorithm reduces response time greatly and consequently increases the query-execution time.
To compare: in PostgreSQL, such a query would be executed in one thread, which would read all rows (since PostgreSQL has a row type of data storage) and sequentially calculate the arithmetic mean for each year.
Data Sources
Hydra also has tools to integrate with other DBMSs like MySQL and PostgreSQL. These DBMSs can serve as data sources for Hydra tables. Data stored in S3 and Google Sheets can also be used as sources.
Each method has its instructions for integration, but they are all very similar because they are implemented using a wrapper module for an external data source; for example, the postgres_fdw module is used to integrate with PostgreSQL.
I won’t dive into the integration instructions in this post, as each method is well described in the documentation.
Query Testing
Let’s finally see what Hydra does in action. I will put its performance to the test on UK home sales data since 1995. You’ve already seen the structure of the table. The table will be partitioned by year, with each partition storing data in columns. The table contains 28 million records.
Well, let’s look at the queries themselves!
1) Calculate the total number of posted ads for all time.
SELECT COUNT(*)
FROM uk_price_paid;
2) Find the average price for each year.
SELECT
DATE_PART(‘year’, date) AS year,
ROUND(AVG(price)) AS avg_price
FROM uk_price_paid
GROUP BY year
ORDER BY year ASC;
3) Find the average price for each year in London.
SELECT
DATE_PART(‘year’, date) AS year,
ROUND(AVG(price)) AS avg_price
FROM uk_price_paid
WHERE town = ‘LONDON’
GROUP BY year
ORDER BY year ASC;
4) Top 100 city neighborhoods with average price in 2020 with over 100 listings.
SELECT
town,
district,
COUNT(*) AS c,
ROUND(AVG(price)) AS avg_price
FROM uk_price_paid
WHERE date >= ‘2020-01-01’
GROUP BY town, district
HAVING COUNT(*) >= 100
ORDER BY avg_price DESC
LIMIT 100;
5) Show all combinations of house types and the number of their listings in descending order of their total listing price.
SELECT
t.name AS type,
d.name AS duration,
COUNT(*) AS c,
ROUND(SUM(price)) AS sum_price
FROM uk_price_paid AS upp
INNER JOIN type AS t ON upp.type = t.id
INNER JOIN duration AS d ON upp.duration
= d.id
WHERE date >= ‘2010-01-01’
GROUP BY t.name, d.name
HAVING COUNT(*) >= 100
ORDER BY sum_price DESC;
6) Show information about the streets with the most expensive and cheapest housing.
SELECT
price,
date,
postcode1,
t.name AS type,
d.name AS duration,
street,
locality,
town,
district,
county
FROM uk_price_paid AS upp
INNER JOIN type AS t ON upp.type = t.id
INNER JOIN duration AS d ON upp.duration = d.id
WHERE street IN (
(SELECT street
FROM uk_price_paid
ORDER BY price DESC
LIMIT 1)
UNION ALL
(SELECT street
FROM uk_price_paid
ORDER BY price ASC
LIMIT 1)
)
ORDER BY price DESC;
The query run times are given in the table below:
The results are impressive since I ran these queries on the weakest hardware configuration for the database.
Conclusion
The Hydra DBMS is a great solution for a data warehouse. Its support of native PostgreSQL dialect will help many implement Hydra into their projects quickly, and columnar data storage will help run complex analytical queries efficiently. At the same time, the ability to create regular PostgreSQL tables puts Hydra on par with the most commonly used DBMSs for OLTP.
Hydra is a budding project that has taken all the advantages of PostgreSQL and added the columnar way of storing table data and parallel execution of queries. I’m looking forward to seeing Hydra’s developers introduce new functionality to perform more complex analytical queries.