How the CUME_DIST() function works in Mariadb?

The MariaDB CUME_DIST() function is used to calculate the cumulative distribution of a value within a group of values.

Posted on

The MariaDB CUME_DIST() function is used to calculate the cumulative distribution of a value within a group of values. It returns the relative position of a specified value in a group of values, ranging from 0 to 1.

Syntax

The syntax for the MariaDB CUME_DIST() function is as follows:

CUME_DIST() OVER (
    PARTITION BY expression1
    ORDER BY expression2 [ASC | DESC]
)
  • PARTITION BY expression1 (optional): This clause is used to divide the result set into partitions or groups based on the specified expression.
  • ORDER BY expression2 [ASC | DESC]: This clause is required and is used to sort the values within each partition based on the specified expression. The sort order can be either ascending (ASC) or descending (DESC).

The CUME_DIST() function returns a value between 0 and 1, representing the cumulative distribution of the current row’s value within its partition.

Examples

Example 1: Basic Usage

This example demonstrates the basic usage of the CUME_DIST() function.

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(50),
    salary INT
);

INSERT INTO employees (id, name, salary) VALUES
    (1, 'John', 5000),
    (2, 'Jane', 6000),
    (3, 'Bob', 4000),
    (4, 'Alice', 6000),
    (5, 'Mike', 7000);

SELECT
    name,
    salary,
    CUME_DIST() OVER (ORDER BY salary) AS cume_dist
FROM employees;

The output will be:

+-------+--------+--------------+
| name  | salary | cume_dist    |
+-------+--------+--------------+
| Bob   |   4000 | 0.2000000000 |
| John  |   5000 | 0.4000000000 |
| Alice |   6000 | 0.8000000000 |
| Jane  |   6000 | 0.8000000000 |
| Mike  |   7000 | 1.0000000000 |
+-------+--------+--------------+

This example shows the cumulative distribution of salaries for each employee. The CUME_DIST() function calculates the relative position of each salary value within the overall sorted set of salaries.

Example 2: Partition by Department

This example demonstrates how to use the PARTITION BY clause to partition the data by department.

DROP TABLE IF EXISTS employees;
CREATE TABLE employees (
    id INT PRIMARY KEY,
    name VARCHAR(50),
    salary INT,
    dept VARCHAR(50)
);

INSERT INTO employees (id, name, salary, dept) VALUES
    (1, 'John', 5000, 'Sales'),
    (2, 'Jane', 6000, 'Marketing'),
    (3, 'Bob', 4000, 'Sales'),
    (4, 'Alice', 6000, 'Marketing'),
    (5, 'Mike', 7000, 'Sales');

SELECT
    name,
    dept,
    salary,
    CUME_DIST() OVER (PARTITION BY dept ORDER BY salary) AS cume_dist
FROM employees;

The output will be:

+-------+-----------+--------+--------------+
| name  | dept      | salary | cume_dist    |
+-------+-----------+--------+--------------+
| Jane  | Marketing |   6000 | 1.0000000000 |
| Alice | Marketing |   6000 | 1.0000000000 |
| Bob   | Sales     |   4000 | 0.3333333333 |
| John  | Sales     |   5000 | 0.6666666667 |
| Mike  | Sales     |   7000 | 1.0000000000 |
+-------+-----------+--------+--------------+

In this example, the CUME_DIST() function calculates the cumulative distribution within each department, partitioning the data by the dept column and sorting by the salary column.

Example 3: Descending Order

This example demonstrates how to use the DESC keyword to sort the values in descending order.

SELECT
    name,
    salary,
    CUME_DIST() OVER (ORDER BY salary DESC) AS cume_dist
FROM employees;

The output will be:

+-------+--------+--------------+
| name  | salary | cume_dist    |
+-------+--------+--------------+
| Mike  |   7000 | 0.2000000000 |
| Alice |   6000 | 0.6000000000 |
| Jane  |   6000 | 0.6000000000 |
| John  |   5000 | 0.8000000000 |
| Bob   |   4000 | 1.0000000000 |
+-------+--------+--------------+

In this example, the CUME_DIST() function calculates the cumulative distribution based on the descending order of salaries.

Example 4: Tie Values

This example shows how the CUME_DIST() function handles tie values.

DROP TABLE IF EXISTS students;
CREATE TABLE students (
    id INT PRIMARY KEY,
    name VARCHAR(50),
    score INT
);

INSERT INTO students (id, name, score) VALUES
    (1, 'John', 80),
    (2, 'Jane', 90),
    (3, 'Bob', 85),
    (4, 'Alice', 90),
    (5, 'Mike', 80);

SELECT
    name,
    score,
    CUME_DIST() OVER (ORDER BY score) AS cume_dist
FROM students;

The output will be:

+-------+-------+--------------+
| name  | score | cume_dist    |
+-------+-------+--------------+
| John  |    80 | 0.4000000000 |
| Mike  |    80 | 0.4000000000 |
| Bob   |    85 | 0.6000000000 |
| Alice |    90 | 1.0000000000 |
| Jane  |    90 | 1.0000000000 |
+-------+-------+--------------+

In this example, the CUME_DIST() function assigns the same cumulative distribution value to rows with tie values (in this case, ‘John’ and ‘Mike’ have the same score of 80).

Example 5: Multiple Partitions and Order By

This example demonstrates how to use multiple partitions and order by clauses.

DROP TABLE IF EXISTS sales;
CREATE TABLE sales (
    id INT PRIMARY KEY,
    product VARCHAR(50),
    region VARCHAR(50),
    revenue INT
);

INSERT INTO sales (id, product, region, revenue) VALUES
    (1, 'Product A', 'North', 1000),
    (2, 'Product A', 'South', 2000),
    (3, 'Product B', 'North', 1500),
    (4, 'Product B', 'South', 2500),
    (5, 'Product A', 'North', 1200);

SELECT
    product,
    region,
    revenue,
    CUME_DIST() OVER (PARTITION BY product, region ORDER BY revenue) AS cume_dist
FROM sales;

The output will be:

+-----------+--------+---------+--------------+
| product   | region | revenue | cume_dist    |
+-----------+--------+---------+--------------+
| Product A | North  |    1000 | 0.5000000000 |
| Product A | North  |    1200 | 1.0000000000 |
| Product A | South  |    2000 | 1.0000000000 |
| Product B | North  |    1500 | 1.0000000000 |
| Product B | South  |    2500 | 1.0000000000 |
+-----------+--------+---------+--------------+

In this example, the CUME_DIST() function calculates the cumulative distribution within each product and region partition, sorted by the revenue column.

The following are a few functions related to the MariaDB CUME_DIST() function:

  • MariaDB RANK() function is used to assign a rank to each row within a partition, with ties receiving the same rank.
  • MariaDB DENSE_RANK() function is used to assign a rank to each row within a partition, with ties receiving consecutive ranks.
  • MariaDB PERCENT_RANK() function is used to calculate the relative rank of a value within a group of values, ranging from 0 to 1.

Conclusion

The MariaDB CUME_DIST() function is a powerful tool for calculating the cumulative distribution of values within groups or partitions. It can be used in various scenarios, such as analyzing sales data, academic performance, or any other data that requires ranking or distribution analysis. By understanding how to use the CUME_DIST() function effectively, you can gain valuable insights into your data and make informed decisions based on the relative positions of values within their respective groups.