Cogs and Levers A blog full of technical stuff

More Window Functions

This article is an extension of the window function, to follow up on each of the window functions.

MAX

The MAX function will retrieve the maximum value that it sees within the window.

MEDIAN

The MEDIAN function will calculate the median value for the range seen, within the window.

MIN

The MIN function will retrieve the minimum value that it sees within the window.

NTH_VALUE

Where the LAG and LEAD values are relative to the row in question, NTH_VALUE will retain the value at the literal offset specified.

NTILE

NTILE ranks rows into equally proportioned groups within the window seen by the expression.

PERCENT_RANK

Calculates the percent rank on rows seen by the window. The formula calculation is defined as:

(x - 1) / (the number of rows in the window or partition - 1)

PERCENTILE_CONT

PERCENTIL_CONT will calculate the linear interpolation between ordered values.

PERCENTILE_DISC

Returns the value with the smallest cumulative distribution value.

RATIO_TO_REPORT

Calculates a percentage where the row’s value is the divisor, and the total amount for the window is the dividend.

In this example, we’ll use RATIO_TO_REPORT to show us the percentage each sale makes over the period of a single day.

SELECT sale_date, salesperson, quantity::decimal * unit_cost,
  RATIO_TO_REPORT(quantity::decimal * unit_cost) OVER (PARTITION BY sale_date)
FROM public.sales
ORDER BY sale_date, sale_id;

RATIO_TO_REPORT(quantity::decimal * unit_cost) is what gives us the value that we’re working with in terms of ratio; the PARTITION BY sale_date then gives us the window; these ratios need to be calculated for the day.

sale_date salesperson ?column? ratio_to_report
2018-05-02 Bob 26 1.0
2018-05-13 Sally 60 0.2777777777777778
2018-05-13 June 156 0.7222222222222222
2018-05-14 John 96 1.0
2018-05-25 Bob 192 0.5962732919254659
2018-05-25 Sally 130 0.40372670807453415
2018-05-26 John 156 1.0
2018-05-27 John 52 0.0962962962962963
2018-05-27 June 20 0.037037037037037035
2018-05-27 June 468 0.8666666666666667
2018-06-02 Sally 26 1.0
2018-06-03 John 60 0.2777777777777778
2018-06-03 John 156 0.7222222222222222
2018-06-12 John 96 1.0
2018-06-13 Bob 192 0.5962732919254659
2018-06-13 Sally 130 0.40372670807453415
2018-06-15 John 156 1.0
2018-06-24 Bob 52 0.7222222222222222
2018-06-24 Sally 20 0.2777777777777778
2018-06-29 John 468 1.0

ROW_NUMBER

ROW_NUMBER is a utility function that simply gives the row an ordinal value, counting up from 1; over the window.

We count the sales for the day, by applying ROW_NUMBER over the sale_date.

SELECT sale_date, salesperson, quantity::decimal * unit_cost,
  ROW_NUMBER() OVER (PARTITION BY sale_date)
FROM public.sales
ORDER BY sale_date, sale_id;
sale_date salesperson ?column? row_number
2018-05-02 Bob 26 1
2018-05-13 Sally 60 1
2018-05-13 June 156 2
2018-05-14 John 96 1
2018-05-25 Bob 192 1
2018-05-25 Sally 130 2
2018-05-26 John 156 1
2018-05-27 John 52 1
2018-05-27 June 20 2
2018-05-27 June 468 3
2018-06-02 Sally 26 1
2018-06-03 John 60 1
2018-06-03 John 156 2
2018-06-12 John 96 1
2018-06-13 Bob 192 1
2018-06-13 Sally 130 2
2018-06-15 John 156 1
2018-06-24 Bob 52 1
2018-06-24 Sally 20 2
2018-06-29 John 468 1

STDDEV_SAMP and STDDEV_POP

STDDEV_SAMP and STDDEV_POP will find the sample and population standard deviation of the values seen in a window.

SUM

The SUM function will retrieve the accumulated sum of an expression over the defined window.

VAR_SAMP and VAR_POP

VAR_SAMP and VAR_POP will find the sample and population variance of the values seen in a window.

Window Function LISTAGG

This article is an extension of the window function, to follow up on each of the window functions.

The LISTAGG function will aggregate values that are collected within the window defined by the partition expression.

In this example, we’ll list each sale in the database; adding a column at the end which will list all of the salespeople who made a sale on that particular day.

SELECT sale_date,
  LISTAGG(salesperson, ', ') OVER (PARTITION BY sale_date)
FROM public.sales
ORDER BY sale_date, sale_id;

This emits a dataset as follows:

sale_date listagg
2018-05-02 Bob
2018-05-13 Sally, June
2018-05-13 Sally, June
2018-05-14 John
2018-05-25 Bob, Sally
2018-05-25 Bob, Sally
2018-05-26 John
2018-05-27 John, June, June
2018-05-27 John, June, June
2018-05-27 John, June, June
2018-06-02 Sally
2018-06-03 John, John
2018-06-03 John, John
2018-06-12 John
2018-06-13 Bob, Sally
2018-06-13 Bob, Sally
2018-06-15 John
2018-06-24 Bob, Sally
2018-06-24 Bob, Sally
2018-06-29 John

Window Function LEAD

This article is an extension of the window function, to follow up on each of the window functions.

The LEAD function returns values in an offset row. The offset is given by the second argument to the function itself. The offset pushes the index below (or after) the selected row.

To demonstrate the effect of this function, we’ll take the sales_id value, and then we’ll also take a led sales_id. By adjusting the offset value, you’ll see how the leaded value ascends up the rows.

SELECT sale_id,
  LEAD(sale_id, 1) OVER (ORDER BY sale_id)
FROM public.sales
ORDER BY sale_date, sale_id
LIMIT 5;
sale_id lead
1 2
2 3
3 4
4 5
5 6

Adjusting the index so that it’s now 2; the lead row is now offset by 2.

sale_id lead
1 3
2 4
3 5
4 6
5 7

Window Function LAG

This article is an extension of the window function, to follow up on each of the window functions.

The LAG function returns values in an offset row. The offset is given by the second argument to the function itself. The offset pushes the index above (or before) the selected row.

To demonstrate the effect of this function, we’ll take the sales_id value, and then we’ll also take a lagged sales_id. By adjusting the offset value, you’ll see how the lagged value slips down the rows.

SELECT sale_id,
  LAG(sale_id, 1) OVER (ORDER BY sale_id)
FROM public.sales
ORDER BY sale_date, sale_id
LIMIT 5;

Note how the first row here doesn’t have a value, as there isn’t a row above row 1.

sale_id lag
1  
2 1
3 2
4 3
5 4

Adjusting the index so that it’s now 2, youll see that both rows 1 and 2 now don’t have a value.

sale_id lag
1  
2  
3 1
4 2
5 3

Windowing functions in Redshift

Introduction

Window functions allow database developers to perform analysis over partitions of information, very quickly. Prior to window functions, developers would need to create sub-queries (or common table expressions) that would allow their windows to be created. Primary queries would then work over the sub-queries to propgate a result. This now goes away, using the OVER syntax.

In today’s article we’ll walk through the basic structure of how to use window functions.

Structure

As documented by AWS,

the standard window function syntax is:

function (expression) OVER (
[ PARTITION BY expr_list ]
[ ORDER BY order_list [ frame_clause ] ] )

where function is one of the functions described in this section and expr_list is:

expression | column_name [, expr_list ]

and order_list is:

expression | column_name [ ASC | DESC ]
[ NULLS FIRST | NULLS LAST ]
[, order_list ]

and frame_clause is:

ROWS
{ UNBOUNDED PRECEDING | unsigned_value PRECEDING | CURRENT ROW } |

{BETWEEN
{ UNBOUNDED PRECEDING | unsigned_value { PRECEDING | FOLLOWING } |
CURRENT ROW}
AND
{ UNBOUNDED FOLLOWING | unsigned_value { PRECEDING | FOLLOWING } |
CURRENT ROW }}

The function gets applied to an expression of columns that you’re extracting from your query. The amount of information that the function is applied to is controlled by PARTITION BY. How the information is ordered within the window is then controlled by the ORDER BY.

This is all text-booky kind of stuff, so let’s apply it to some actual data in the database.

Sample Data

For today’s article, lets consider a really simple sales breakdown. No relationships to other table; everything is locallised and contained in here (no that it needs to be - it just makes the example simpler).

CREATE TABLE public.sales (
  sale_id			INT PRIMARY KEY,
  sale_date			DATE NOT NULL,
  salesperson		VARCHAR(128) NOT NULL,
  quantity			INTEGER,
  unit_cost			DECIMAL,
  product_code		VARCHAR(32) NOT NULL
);

I’ve filled the table with some information, spanning two months.

sale_id sale_date salesperson quantity unit_cost product_code
1 2018-05-02 Bob 1 26 VAC01
2 2018-05-13 Sally 3 20 BROOM01
3 2018-05-13 June 1 156 MOW01
4 2018-05-14 John 1 96 TRIM01
5 2018-05-25 Bob 2 96 TRIM01
6 2018-05-25 Sally 5 26 VAC01
7 2018-05-26 John 1 156 MOW01
8 2018-05-27 John 2 26 VAC01
9 2018-05-27 June 1 20 BROOM01
10 2018-05-27 June 3 156 MOW01
11 2018-06-02 Sally 1 26 VAC01
12 2018-06-03 John 3 20 BROOM01
13 2018-06-03 John 1 156 MOW01
14 2018-06-12 John 1 96 TRIM01
15 2018-06-13 Bob 2 96 TRIM01
16 2018-06-13 Sally 5 26 VAC01
17 2018-06-15 John 1 156 MOW01
18 2018-06-24 Bob 2 26 VAC01
19 2018-06-24 Sally 1 20 BROOM01
20 2018-06-29 John 3 156 MOW01

Now that we have some information to work with, let’s look at a simple example.

Simplicity

In this first example, we’ll demonstrate the monthly averages for all sales people; and compare each instance of a sale to it.

Traditionally, we could solve this issue by employing a CTE to perform our aggregation, and join to this calculated set from our main query:

WITH avg_month AS (
  SELECT date_part('month', sale_date) as month,
  		 date_part('year', sale_date) as year,
  		 avg(quantity * unit_cost) as average
  FROM public.sales
  GROUP BY date_part('month', sale_date), date_part('year', sale_date)
)
SELECT salesperson,
	   sale_date,
       quantity * unit_cost,
       avg_month.average AS month_avg
FROM public.sales
INNER JOIN avg_month ON avg_month.month = date_part('month', sale_date)
AND						avg_month.year = date_part('year', sale_date)
ORDER BY salesperson, sale_date DESC;

The output of which starts to look like this:

salesperson sale_date ?column? month_avg
Bob 2018-06-24 52 135
Bob 2018-06-13 192 135
Bob 2018-05-25 192 135
Bob 2018-05-02 26 135
John 2018-06-29 468 135
John 2018-06-15 156 135
John 2018-06-12 96 135
John 2018-06-03 156 135
John 2018-06-03 60 135
John 2018-05-27 52 135
John 2018-05-26 156 135
John 2018-05-14 96 135
June 2018-05-27 468 135
June 2018-05-27 20 135
June 2018-05-13 156 135
Sally 2018-06-24 20 135
Sally 2018-06-13 130 135
Sally 2018-06-02 26 135
Sally 2018-05-25 130 135
Sally 2018-05-13 60 135

We can simplify the query above greatly with the use of window functions.

SELECT salesperson, sale_date, (quantity * unit_cost),
	   avg(quantity * unit_cost) OVER (
         PARTITION BY date_part('month', sale_date), date_part('year', sale_date)
       ) as month_avg
FROM public.sales
ORDER BY salesperson, sale_date DESC;

This offers the same result, with a much cleaner query experience.

To look into this data a little further, and perform the same operation over the salesperson’s averages, we simply swap out the PARTITION BY expression:

SELECT salesperson, sale_date, (quantity * unit_cost),
	   avg(quantity * unit_cost) OVER (
         PARTITION BY salesperson
       ) as month_avg
FROM public.sales
ORDER BY salesperson, sale_date DESC;

This now tells us how each of the saleperson’s sales performed against their average.

Functions

With a basic understanding of the windowing function framework and how it can be applied to your queries, any of the supported functions can be use. The actual documentation of these functions can be found in the AWS documentation for Redshift, so I won’t reproduce that material here.

The list of functions are:

  • AVG
  • COUNT
  • CUME_DIST
  • DENSE_RANK
  • FIRST_VALUE
  • LAST_VALUE
  • LAG
  • LEAD
  • LISTAGG
  • MAX
  • MEDIAN
  • MIN
  • NTH_VALUE
  • NTILE
  • PERCENT_RANK
  • PERCENTILE_CONT
  • PERCENTILE_DISC
  • RANK
  • RATIO_TO_REPORT
  • ROW_NUMBER
  • STDDEV_SAMP
  • STDDEV_POP
  • SUM
  • VAR_SAMP
  • VAR_POP

In future articles I’ll cover these functions and how they can be used to perform analysis over your datasets.