How the DECODE_HISTOGRAM() function works in Mariadb?

The DECODE_HISTOGRAM() function is a built-in function that can decode the histogram data stored in the mysql.column_stats table and return a string of comma-separated numeric values that represent the probability distribution of the column values.

Posted on

Histograms are a useful tool for analyzing the distribution of data in a column. They can help optimize queries by providing more accurate statistics for the query optimizer. Mariadb supports histogram-based statistics for columns, which can be collected and stored in the mysql.column_stats table using the ANALYZE TABLE statement.

The DECODE_HISTOGRAM() function is a built-in function that can decode the histogram data stored in the mysql.column_stats table and return a string of comma-separated numeric values that represent the probability distribution of the column values. This function can be used to inspect the histogram data and understand how the column values are distributed.

Syntax

The syntax of the DECODE_HISTOGRAM() function is as follows:

DECODE_HISTOGRAM(hist_type, histogram)

The function takes two arguments:

  • hist_type: This is an enumeration value that specifies the type of the histogram. It can be either SINGLE_PREC_HB or DOUBLE_PREC_HB, which stand for single-precision and double-precision histograms, respectively. The histogram type is determined by the histogram_type system variable when the histogram data is collected by the ANALYZE TABLE statement.
  • histogram: This is a binary string that contains the histogram data for a column. It can be obtained from the histogram column of the mysql.column_stats table.

The function returns a string of comma-separated numeric values that correspond to the probability distribution of the column values. The number of values in the string is equal to the number of buckets in the histogram, which is determined by the histogram_size system variable when the histogram data is collected by the ANALYZE TABLE statement. The values are normalized to sum up to 1, and each value represents the fraction of rows that fall into the corresponding bucket of the histogram.

Examples

In this section, we will show some examples of how to use the DECODE_HISTOGRAM() function to decode and inspect the histogram data for a column.

Example 1: Decoding a single-precision histogram

Suppose we have a table called origin that contains two columns: i and v. The column i is the primary key and has values from 1 to 18. The column v has some random values as shown below:

SELECT * FROM origin;
+----+-----+
| i  | v   |
+----+-----+
|  1 |   1 |
|  2 |   2 |
|  3 |   3 |
|  4 |   4 |
|  5 |   5 |
|  6 |  10 |
|  7 |  20 |
|  8 |  30 |
|  9 |  40 |
| 10 |  50 |
| 11 |  60 |
| 12 |  70 |
| 13 |  80 |
| 14 |  90 |
| 15 | 100 |
| 16 | 200 |
| 17 | 400 |
| 18 | 800 |
+----+-----+

We can collect and store the histogram data for the column v by using the following statement:

SET histogram_size=10, histogram_type=SINGLE_PREC_HB;
ANALYZE TABLE origin PERSISTENT FOR ALL;

The histogram_size variable specifies the number of buckets in the histogram, and the histogram_type variable specifies the type of the histogram. In this case, we use 10 buckets and a single-precision histogram.

The ANALYZE TABLE statement collects the histogram data and stores it in the mysql.column_stats table. We can query this table to see the histogram data for the column v as follows:

SELECT db_name, table_name, column_name, hist_type, hex(histogram)
FROM mysql.column_stats
WHERE db_name='test' AND table_name='origin' AND column_name='v';
+---------+------------+-------------+----------------+----------------------+
| db_name | table_name | column_name | hist_type      | hex(histogram)       |
+---------+------------+-------------+----------------+----------------------+
| test    | origin     | v           | SINGLE_PREC_HB | 000001060C0F161C1F7F |
+---------+------------+-------------+----------------+----------------------+

The hist_type column shows the type of the histogram, and the histogram column shows the binary string that contains the histogram data. We can use the DECODE_HISTOGRAM() function to decode this binary string and get the probability distribution of the column values as follows:

SELECT DECODE_HISTOGRAM(hist_type, histogram)
FROM mysql.column_stats
WHERE db_name='test' AND table_name='origin' AND column_name='v';
+-------------------------------------------------------------------+
| DECODE_HISTOGRAM(hist_type, histogram)                            |
+-------------------------------------------------------------------+
| 0.000,0.000,0.004,0.020,0.024,0.012,0.027,0.024,0.012,0.376,0.502 |
+-------------------------------------------------------------------+

The result is a string of 11 comma-separated numeric values that sum up to 1. Each value represents the fraction of rows that fall into the corresponding bucket of the histogram. For example, the first value 0.000 means that no rows have values between 0 and 80, the second value 0.000 means that no rows have values between 80 and 160, and so on. The last value 0.502 means that half of the rows have values between 720 and 800.

We can also visualize the histogram by using a bar chart as shown below:

![Histogram for column v]

The x-axis shows the range of values for each bucket, and the y-axis shows the fraction of rows for each bucket. The bar chart shows that the column values are skewed to the right, with most of the rows having values close to 800.

Example 2: Decoding a double-precision histogram

Suppose we have the same table origin as in the previous example, but we want to use a double-precision histogram instead of a single-precision histogram. We can collect and store the histogram data for the column v by using the following statement:

SET histogram_size=20, histogram_type=DOUBLE_PREC_HB;
ANALYZE TABLE origin PERSISTENT FOR ALL;

The histogram_size variable specifies the number of buckets in the histogram, and the histogram_type variable specifies the type of the histogram. In this case, we use 20 buckets and a double-precision histogram.

The ANALYZE TABLE statement collects the histogram data and stores it in the mysql.column_stats table. We can query this table to see the histogram data for the column v as follows:

SELECT db_name, table_name, column_name, hist_type, hex(histogram)
FROM mysql.column_stats
WHERE db_name='test' AND table_name='origin' AND column_name='v';
+---------+------------+-------------+----------------+------------------------------------------+
| db_name | table_name | column_name | hist_type      | hex(histogram)                           |
+---------+------------+-------------+----------------+------------------------------------------+
| test    | origin     | v           | DOUBLE_PREC_HB | 5200F600480116067E0CB30F1B16831CB81FD67F |
+---------+------------+-------------+----------------+------------------------------------------+

The hist_type column shows the type of the histogram, and the histogram column shows the binary string that contains the histogram data. We can use the DECODE_HISTOGRAM() function to decode this binary string and get the probability distribution of the column values as follows:

SELECT DECODE_HISTOGRAM(hist_type, histogram)
FROM mysql.column_stats
WHERE db_name='test' AND table_name='origin' AND column_name='v';
+-----------------------------------------------------------------------------------------+
| DECODE_HISTOGRAM(hist_type, histogram)                                                  |
+-----------------------------------------------------------------------------------------+
| 0.00125,0.00250,0.00125,0.01877,0.02502,0.01253,0.02502,0.02502,0.01253,0.37546,0.50063 |
+-----------------------------------------------------------------------------------------+

The result is a string of 11 comma-separated numeric values that sum up to 1. Each value represents the fraction of rows that fall into the corresponding bucket of the histogram. For example, the first value 0.00125 means that 0.125% of the rows have values between 0 and 40, the second value 0.00250 means that 0.25% of the rows have values between 40 and 80, and so on. The last value 0.50063 means that 50.063% of the rows have values between 760 and 800.