Cogs and Levers A blog full of technical stuff

Window Function DENSE_RANK

This article is an extension of the window function, to follow up on each of the window functions.

The DENSE_RANK function will use the ORDER BY expression to determine a rank of a value amongst groups. If a PARTITION BY is present, this is also included in the reset of groups.

The important difference between RANK and DENSE_RANK is that RANK will use to total number of rows to nominate it’s highest rank value; whereas DENSE_RANK will only use the number of ranks calculated, and the highest rank.

Take the following for example:

SELECT sale_id, quantity, unit_cost,
   DENSE_RANK() OVER (ORDER BY quantity DESC),
   RANK() OVER (ORDER BY quantity DESC)
FROM sales
ORDER BY quantity, sale_id;

The sale_id, quantity, and unit_cost are retrieved and ranked by the quantity value. Both RANK and DENSE_RANK are used in this example to illustrate the difference in value distribution (along the number line).

As there is a change in quantity, a new rank value is assigned:

sale_id quantity unit_cost dense_rank rank
1 1 26 4 11
3 1 156 4 11
4 1 96 4 11
7 1 156 4 11
9 1 20 4 11
11 1 26 4 11
13 1 156 4 11
14 1 96 4 11
17 1 156 4 11
19 1 20 4 11
5 2 96 3 7
8 2 26 3 7
15 2 96 3 7
18 2 26 3 7
2 3 20 2 3
10 3 156 2 3
12 3 20 2 3
20 3 156 2 3
6 5 26 1 1
16 5 26 1 1

Window Function CUME_DIST

This article is an extension of the window function, to follow up on each of the window functions.

The CUME_DIST function calculates the cumulative distribution of a value, for the data that’s seen.

An example of calculating the cumulative distribution of each salesperson, and the amount of value that they’ve sold would look like this:

SELECT salesperson, quantity::decimal * unit_cost,
	   CUME_DIST() OVER (
         PARTITION BY salesperson
         ORDER BY (quantity::decimal * unit_cost)
       ) as distr
FROM public.sales
ORDER BY salesperson;

As sales people have sold the same amount in different sales, you can see that the cumulative distribution value makes accumulative jumps.

salesperson value distr
Bob 26 0.25
Bob 52 0.5
Bob 192 1.0
Bob 192 1.0
John 52 0.125
John 60 0.25
John 96 0.5
John 96 0.5
John 156 0.875
John 156 0.875
John 156 0.875
John 468 1.0
June 20 0.3333333333333333
June 156 0.6666666666666666
June 468 1.0
Sally 20 0.2
Sally 26 0.4
Sally 60 0.6
Sally 130 1.0
Sally 130 1.0

Taking an interesting piece of this window, we focus in on John.

salesperson value distr
John 52 0.125
John 60 0.25
John 96 0.5
John 96 0.5
John 156 0.875
John 156 0.875
John 156 0.875
John 468 1.0

The repeated value amounts don’t seem to shift the distribution value (because of its cumulativity). By the time the next amount is present, there’s a significant jump. This is a property of this set because the data is ordered by the value column.

Window Function COUNT

This article is an extension of the window function, to follow up on each of the window functions.

The COUNT function will counts each row that contains non-NULL values for the given expression.

An example usage over the test data set would be:

SELECT sale_date, salesperson,
	   COUNT(*) OVER (
         PARTITION BY sale_date
         ORDER BY sale_id
         ROWS UNBOUNDED PRECEDING
       ) as sale_number
FROM public.sales
ORDER BY sale_date;

We list the sale_date and salesperson out for every sale on record. The COUNT partitions by the date of the sale made. This makes the sale_number tell us what number the sale was for that given date.

sale_date salesperson sales_count
2018-05-02 Bob 1
2018-05-13 Sally 1
2018-05-13 June 2
2018-05-14 John 1
2018-05-25 Bob 1
2018-05-25 Sally 2
2018-05-26 John 1
2018-05-27 John 1
2018-05-27 June 2
2018-05-27 June 3
2018-06-02 Sally 1
2018-06-03 John 1
2018-06-03 John 2
2018-06-12 John 1
2018-06-13 Bob 1
2018-06-13 Sally 2
2018-06-15 John 1
2018-06-24 Bob 1
2018-06-24 Sally 2
2018-06-29 John 1

Taking 2018-05-27, we can see that John got the first sale, June got the second and third.

Window Function AVG

This article is an extension of the window function, to follow up on each of the window functions.

The AVG function takes the arithmetic mean of the input that it sees.

An example usage over the test data set would be:

SELECT salesperson, sale_date, (quantity * unit_cost),
	   avg(quantity * unit_cost) OVER (
         PARTITION BY salesperson
       ) as month_avg
FROM public.sales
ORDER BY salesperson, sale_date DESC;

This takes the average cost value (quantity * unit_cost), but personalises the value by salesperson. Each salesperson is going to have their average calculated in isolation because of PARTITION BY salesperson.

Range generation in Redshift

A little while ago, I’d written an article on Range generation in PostgreSQL, and whilst Redshift and PostgreSQL are similar database beasts, they are quite different in terms of their query capabilities due to their different strengths.

In today’s article, we’ll walk through a couple of different solutions to range generation.

generate_series

The generate_series is a pretty serious tool in terms of achieving exactly what this article is talking about. Unfortunately, there is no analog for this function within the Redshift database management system. Who knows; maybe one day.

Until then though, we need to create our own.

The query

with digit as (
    select 0 as d union all
    select 1 union all select 2 union all select 3 union all
    select 4 union all select 5 union all select 6 union all
    select 7 union all select 8 union all select 9
),
seq as (
    select a.d + (10 * b.d) + (100 * c.d) + (1000 * d.d) as num
    from digit a
        cross join
        digit b
        cross join
        digit c
        cross join
        digit d
    order by 1
),
drange as (
    SELECT (getdate()::date - seq.num)::date       as start_date
    FROM seq
    LIMIT 20
)
SELECT start_date
FROM drange;

There’s a bit going on in here.

The process starts with the digit CTE. Is does nothing more than counts from 0, up to 9.

select 0 as d union all
select 1 union all select 2 union all select 3 union all
select 4 union all select 5 union all select 6 union all
select 7 union all select 8 union all select 9

The next CTE seq uses the digits CTE, 4 times. Each time its treated as an increasing power of 10. a handles the 10^0, b handles the 10^1, etc. This catersian effect quickly multiplies our 0-9 range up to 0-9999. If more numbers are required, we simple add another CROSS JOIN, and treat the next power of 10.

select a.d + (10 * b.d) + (100 * c.d) + (1000 * d.d) as num
from digit a
    cross join
    digit b
    cross join
    digit c
    cross join
    digit d
order by 1

That’s it as far a number generation. It’s pretty easy.

I’ve seen other developers materialise this information into a physical table, and simply join to that table also.

Extending into time

Applying this into some date ranges now, we use the seq CTE and a little date arithmatic. My example here just uses the numbers within seq to enumerate dates. Adding LIMIT to this query also prevents us from utilising all 10000 values produced by seq.

SELECT (getdate()::date - seq.num)::date       as start_date
FROM seq
LIMIT 20

And that it.