Windowing functions in Redshift

07 Jul 2019

Introduction

Window functions allow database developers to perform analysis over partitions of information, very quickly. Prior to window functions, developers would need to create sub-queries (or common table expressions) that would allow their windows to be created. Primary queries would then work over the sub-queries to propgate a result. This now goes away, using the OVER syntax.

In today’s article we’ll walk through the basic structure of how to use window functions.

Structure

As documented by AWS,

the standard window function syntax is:

function (expression) OVER (
[ PARTITION BY expr_list ]
[ ORDER BY order_list [ frame_clause ] ] )

where function is one of the functions described in this section and expr_list is:

expression | column_name [, expr_list ]

and order_list is:

expression | column_name [ ASC | DESC ]
[ NULLS FIRST | NULLS LAST ]
[, order_list ]

and frame_clause is:

ROWS
{ UNBOUNDED PRECEDING | unsigned_value PRECEDING | CURRENT ROW } |

{BETWEEN
{ UNBOUNDED PRECEDING | unsigned_value { PRECEDING | FOLLOWING } |
CURRENT ROW}
AND
{ UNBOUNDED FOLLOWING | unsigned_value { PRECEDING | FOLLOWING } |
CURRENT ROW }}

The function gets applied to an expression of columns that you’re extracting from your query. The amount of information that the function is applied to is controlled by PARTITION BY. How the information is ordered within the window is then controlled by the ORDER BY.

This is all text-booky kind of stuff, so let’s apply it to some actual data in the database.

Sample Data

For today’s article, lets consider a really simple sales breakdown. No relationships to other table; everything is locallised and contained in here (no that it needs to be - it just makes the example simpler).

CREATE TABLE public.sales (
  sale_id			INT PRIMARY KEY,
  sale_date			DATE NOT NULL,
  salesperson		VARCHAR(128) NOT NULL,
  quantity			INTEGER,
  unit_cost			DECIMAL,
  product_code		VARCHAR(32) NOT NULL
);

I’ve filled the table with some information, spanning two months.

sale_id	sale_date	salesperson	quantity	unit_cost	product_code
1	2018-05-02	Bob	1	26	VAC01
2	2018-05-13	Sally	3	20	BROOM01
3	2018-05-13	June	1	156	MOW01
4	2018-05-14	John	1	96	TRIM01
5	2018-05-25	Bob	2	96	TRIM01
6	2018-05-25	Sally	5	26	VAC01
7	2018-05-26	John	1	156	MOW01
8	2018-05-27	John	2	26	VAC01
9	2018-05-27	June	1	20	BROOM01
10	2018-05-27	June	3	156	MOW01
11	2018-06-02	Sally	1	26	VAC01
12	2018-06-03	John	3	20	BROOM01
13	2018-06-03	John	1	156	MOW01
14	2018-06-12	John	1	96	TRIM01
15	2018-06-13	Bob	2	96	TRIM01
16	2018-06-13	Sally	5	26	VAC01
17	2018-06-15	John	1	156	MOW01
18	2018-06-24	Bob	2	26	VAC01
19	2018-06-24	Sally	1	20	BROOM01
20	2018-06-29	John	3	156	MOW01

Now that we have some information to work with, let’s look at a simple example.

Simplicity

In this first example, we’ll demonstrate the monthly averages for all sales people; and compare each instance of a sale to it.

Traditionally, we could solve this issue by employing a CTE to perform our aggregation, and join to this calculated set from our main query:

WITH avg_month AS (
  SELECT date_part('month', sale_date) as month,
  		 date_part('year', sale_date) as year,
  		 avg(quantity * unit_cost) as average
  FROM public.sales
  GROUP BY date_part('month', sale_date), date_part('year', sale_date)
)
SELECT salesperson,
	   sale_date,
       quantity * unit_cost,
       avg_month.average AS month_avg
FROM public.sales
INNER JOIN avg_month ON avg_month.month = date_part('month', sale_date)
AND						avg_month.year = date_part('year', sale_date)
ORDER BY salesperson, sale_date DESC;

The output of which starts to look like this:

salesperson	sale_date	?column?	month_avg
Bob	2018-06-24	52	135
Bob	2018-06-13	192	135
Bob	2018-05-25	192	135
Bob	2018-05-02	26	135
John	2018-06-29	468	135
John	2018-06-15	156	135
John	2018-06-12	96	135
John	2018-06-03	156	135
John	2018-06-03	60	135
John	2018-05-27	52	135
John	2018-05-26	156	135
John	2018-05-14	96	135
June	2018-05-27	468	135
June	2018-05-27	20	135
June	2018-05-13	156	135
Sally	2018-06-24	20	135
Sally	2018-06-13	130	135
Sally	2018-06-02	26	135
Sally	2018-05-25	130	135
Sally	2018-05-13	60	135

We can simplify the query above greatly with the use of window functions.

SELECT salesperson, sale_date, (quantity * unit_cost),
	   avg(quantity * unit_cost) OVER (
         PARTITION BY date_part('month', sale_date), date_part('year', sale_date)
       ) as month_avg
FROM public.sales
ORDER BY salesperson, sale_date DESC;

This offers the same result, with a much cleaner query experience.

To look into this data a little further, and perform the same operation over the salesperson’s averages, we simply swap out the PARTITION BY expression:

SELECT salesperson, sale_date, (quantity * unit_cost),
	   avg(quantity * unit_cost) OVER (
         PARTITION BY salesperson
       ) as month_avg
FROM public.sales
ORDER BY salesperson, sale_date DESC;

This now tells us how each of the saleperson’s sales performed against their average.

Functions

With a basic understanding of the windowing function framework and how it can be applied to your queries, any of the supported functions can be use. The actual documentation of these functions can be found in the AWS documentation for Redshift, so I won’t reproduce that material here.

The list of functions are:

AVG
COUNT
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAST_VALUE
LAG
LEAD
LISTAGG
MAX
MEDIAN
MIN
NTH_VALUE
NTILE
PERCENT_RANK
PERCENTILE_CONT
PERCENTILE_DISC
RANK
RATIO_TO_REPORT
ROW_NUMBER
STDDEV_SAMP
STDDEV_POP
SUM
VAR_SAMP
VAR_POP

In future articles I’ll cover these functions and how they can be used to perform analysis over your datasets.

Window Function FIRST_VALUE and LAST_VALUE

07 Jul 2019

This article is an extension of the window function, to follow up on each of the window functions.

The FIRST_VALUE and LAST_VALUE functions will allow you to retrieve the first and last values for a given window (respectively).

Taking the sales example for another run, we can compare every single sale for the day to see how it stacked up against the first sale of the day with the following query:

SELECT sale_date, salesperson, quantity::decimal * unit_cost,
	   FIRST_VALUE(quantity::decimal * unit_cost) OVER (
         PARTITION BY sale_date
         ORDER BY sale_id
         ROWS UNBOUNDED PRECEDING
       ) as sale_number
FROM public.sales
ORDER BY sale_date;

The result of which is as you’d expect:

sale_date	salesperson	?column?	sale_number
2018-05-02	Bob	26	26
2018-05-13	Sally	60	60
2018-05-13	June	156	60
2018-05-14	John	96	96
2018-05-25	Bob	192	192
2018-05-25	Sally	130	192
2018-05-26	John	156	156
2018-05-27	John	52	52
2018-05-27	June	20	52
2018-05-27	June	468	52
2018-06-02	Sally	26	26
2018-06-03	John	60	60
2018-06-03	John	156	60
2018-06-12	John	96	96
2018-06-13	Bob	192	192
2018-06-13	Sally	130	192
2018-06-15	John	156	156
2018-06-24	Bob	52	52
2018-06-24	Sally	20	52
2018-06-29	John	468	468

This demonstrates how the first value can be drawn out of a set, per window.

Inversely, you simply use the LAST_VALUE function to get the last value; if you wanted to see how each sale stacked up against the last sale of the day:

SELECT sale_date, salesperson, quantity::decimal * unit_cost,
	   LAST_VALUE(quantity::decimal * unit_cost) OVER (
         PARTITION BY sale_date
         ORDER BY sale_id
         ROWS UNBOUNDED PRECEDING
       ) as sale_number
FROM public.sales
ORDER BY sale_date;

Window Function DENSE_RANK

07 Jul 2019

This article is an extension of the window function, to follow up on each of the window functions.

The DENSE_RANK function will use the ORDER BY expression to determine a rank of a value amongst groups. If a PARTITION BY is present, this is also included in the reset of groups.

The important difference between RANK and DENSE_RANK is that RANK will use to total number of rows to nominate it’s highest rank value; whereas DENSE_RANK will only use the number of ranks calculated, and the highest rank.

Take the following for example:

SELECT sale_id, quantity, unit_cost,
   DENSE_RANK() OVER (ORDER BY quantity DESC),
   RANK() OVER (ORDER BY quantity DESC)
FROM sales
ORDER BY quantity, sale_id;

The sale_id, quantity, and unit_cost are retrieved and ranked by the quantity value. Both RANK and DENSE_RANK are used in this example to illustrate the difference in value distribution (along the number line).

As there is a change in quantity, a new rank value is assigned:

sale_id	quantity	unit_cost	dense_rank	rank
1	1	26	4	11
3	1	156	4	11
4	1	96	4	11
7	1	156	4	11
9	1	20	4	11
11	1	26	4	11
13	1	156	4	11
14	1	96	4	11
17	1	156	4	11
19	1	20	4	11
5	2	96	3	7
8	2	26	3	7
15	2	96	3	7
18	2	26	3	7
2	3	20	2	3
10	3	156	2	3
12	3	20	2	3
20	3	156	2	3
6	5	26	1	1
16	5	26	1	1

Window Function CUME_DIST

07 Jul 2019

This article is an extension of the window function, to follow up on each of the window functions.

The CUME_DIST function calculates the cumulative distribution of a value, for the data that’s seen.

An example of calculating the cumulative distribution of each salesperson, and the amount of value that they’ve sold would look like this:

SELECT salesperson, quantity::decimal * unit_cost,
	   CUME_DIST() OVER (
         PARTITION BY salesperson
         ORDER BY (quantity::decimal * unit_cost)
       ) as distr
FROM public.sales
ORDER BY salesperson;

As sales people have sold the same amount in different sales, you can see that the cumulative distribution value makes accumulative jumps.

salesperson	value	distr
Bob	26	0.25
Bob	52	0.5
Bob	192	1.0
Bob	192	1.0
John	52	0.125
John	60	0.25
John	96	0.5
John	96	0.5
John	156	0.875
John	156	0.875
John	156	0.875
John	468	1.0
June	20	0.3333333333333333
June	156	0.6666666666666666
June	468	1.0
Sally	20	0.2
Sally	26	0.4
Sally	60	0.6
Sally	130	1.0
Sally	130	1.0

Taking an interesting piece of this window, we focus in on John.

salesperson	value	distr
John	52	0.125
John	60	0.25
John	96	0.5
John	96	0.5
John	156	0.875
John	156	0.875
John	156	0.875
John	468	1.0

The repeated value amounts don’t seem to shift the distribution value (because of its cumulativity). By the time the next amount is present, there’s a significant jump. This is a property of this set because the data is ordered by the value column.

Window Function COUNT

07 Jul 2019

This article is an extension of the window function, to follow up on each of the window functions.

The COUNT function will counts each row that contains non-NULL values for the given expression.

An example usage over the test data set would be:

SELECT sale_date, salesperson,
	   COUNT(*) OVER (
         PARTITION BY sale_date
         ORDER BY sale_id
         ROWS UNBOUNDED PRECEDING
       ) as sale_number
FROM public.sales
ORDER BY sale_date;

We list the sale_date and salesperson out for every sale on record. The COUNT partitions by the date of the sale made. This makes the sale_number tell us what number the sale was for that given date.

sale_date	salesperson	sales_count
2018-05-02	Bob	1
2018-05-13	Sally	1
2018-05-13	June	2
2018-05-14	John	1
2018-05-25	Bob	1
2018-05-25	Sally	2
2018-05-26	John	1
2018-05-27	John	1
2018-05-27	June	2
2018-05-27	June	3
2018-06-02	Sally	1
2018-06-03	John	1
2018-06-03	John	2
2018-06-12	John	1
2018-06-13	Bob	1
2018-06-13	Sally	2
2018-06-15	John	1
2018-06-24	Bob	1
2018-06-24	Sally	2
2018-06-29	John	1

Taking 2018-05-27, we can see that John got the first sale, June got the second and third.

sale_id	quantity	unit_cost	dense_rank	rank
1	1	26	4	11
3	1	156	4	11
4	1	96	4	11
7	1	156	4	11
9	1	20	4	11
11	1	26	4	11
13	1	156	4	11
14	1	96	4	11
17	1	156	4	11
19	1	20	4	11
5	2	96	3	7
8	2	26	3	7
15	2	96	3	7
18	2	26	3	7
2	3	20	2	3
10	3	156	2	3
12	3	20	2	3
20	3	156	2	3
6	5	26	1	1
16	5	26	1	1

sale_id	quantity	unit_cost	dense_rank	rank
1	1	26	4	11
3	1	156	4	11
4	1	96	4	11
7	1	156	4	11
9	1	20	4	11
11	1	26	4	11
13	1	156	4	11
14	1	96	4	11
17	1	156	4	11
19	1	20	4	11
5	2	96	3	7
8	2	26	3	7
15	2	96	3	7
18	2	26	3	7
2	3	20	2	3
10	3	156	2	3
12	3	20	2	3
20	3	156	2	3
6	5	26	1	1
16	5	26	1	1

Cogs and Levers A blog full of technical stuff

Windowing functions in Redshift

Introduction

Structure

Sample Data

Simplicity

Functions

Window Function FIRST_VALUE and LAST_VALUE

Window Function DENSE_RANK

Window Function CUME_DIST

Window Function COUNT

sale_id	quantity	unit_cost	dense_rank	rank
1	1	26	4	11
3	1	156	4	11
4	1	96	4	11
7	1	156	4	11
9	1	20	4	11
11	1	26	4	11
13	1	156	4	11
14	1	96	4	11
17	1	156	4	11
19	1	20	4	11
5	2	96	3	7
8	2	26	3	7
15	2	96	3	7
18	2	26	3	7
2	3	20	2	3
10	3	156	2	3
12	3	20	2	3
20	3	156	2	3
6	5	26	1	1
16	5	26	1	1