This article is an extension of the window function, to follow up on each of the window functions.
The LAG function returns values in an offset row. The offset is given by the second argument to the function itself. The offset pushes the index above (or before) the selected row.
To demonstrate the effect of this function, we’ll take the sales_id value, and then we’ll also take a lagged sales_id. By adjusting the offset value, you’ll see how the lagged value slips down the rows.
Note how the first row here doesn’t have a value, as there isn’t a row above row 1.
sale_id
lag
1
2
1
3
2
4
3
5
4
Adjusting the index so that it’s now 2, youll see that both rows 1 and 2 now don’t have a value.
Window functions allow database developers to perform analysis over partitions of information, very quickly. Prior to window functions, developers would need to create sub-queries (or common table expressions) that would allow their windows to be created. Primary queries would then work over the sub-queries to propgate a result. This now goes away, using the OVER syntax.
In today’s article we’ll walk through the basic structure of how to use window functions.
where function is one of the functions described in this section and expr_list is:
and order_list is:
and frame_clause is:
The function gets applied to an expression of columns that you’re extracting from your query. The amount of information that the function is applied to is controlled by PARTITION BY. How the information is ordered within the window is then controlled by the ORDER BY.
This is all text-booky kind of stuff, so let’s apply it to some actual data in the database.
Sample Data
For today’s article, lets consider a really simple sales breakdown. No relationships to other table; everything is locallised and contained in here (no that it needs to be - it just makes the example simpler).
I’ve filled the table with some information, spanning two months.
sale_id
sale_date
salesperson
quantity
unit_cost
product_code
1
2018-05-02
Bob
1
26
VAC01
2
2018-05-13
Sally
3
20
BROOM01
3
2018-05-13
June
1
156
MOW01
4
2018-05-14
John
1
96
TRIM01
5
2018-05-25
Bob
2
96
TRIM01
6
2018-05-25
Sally
5
26
VAC01
7
2018-05-26
John
1
156
MOW01
8
2018-05-27
John
2
26
VAC01
9
2018-05-27
June
1
20
BROOM01
10
2018-05-27
June
3
156
MOW01
11
2018-06-02
Sally
1
26
VAC01
12
2018-06-03
John
3
20
BROOM01
13
2018-06-03
John
1
156
MOW01
14
2018-06-12
John
1
96
TRIM01
15
2018-06-13
Bob
2
96
TRIM01
16
2018-06-13
Sally
5
26
VAC01
17
2018-06-15
John
1
156
MOW01
18
2018-06-24
Bob
2
26
VAC01
19
2018-06-24
Sally
1
20
BROOM01
20
2018-06-29
John
3
156
MOW01
Now that we have some information to work with, let’s look at a simple example.
Simplicity
In this first example, we’ll demonstrate the monthly averages for all sales people; and compare each instance of a sale to it.
Traditionally, we could solve this issue by employing a CTE to perform our aggregation, and join to this calculated set from our main query:
The output of which starts to look like this:
salesperson
sale_date
?column?
month_avg
Bob
2018-06-24
52
135
Bob
2018-06-13
192
135
Bob
2018-05-25
192
135
Bob
2018-05-02
26
135
John
2018-06-29
468
135
John
2018-06-15
156
135
John
2018-06-12
96
135
John
2018-06-03
156
135
John
2018-06-03
60
135
John
2018-05-27
52
135
John
2018-05-26
156
135
John
2018-05-14
96
135
June
2018-05-27
468
135
June
2018-05-27
20
135
June
2018-05-13
156
135
Sally
2018-06-24
20
135
Sally
2018-06-13
130
135
Sally
2018-06-02
26
135
Sally
2018-05-25
130
135
Sally
2018-05-13
60
135
We can simplify the query above greatly with the use of window functions.
This offers the same result, with a much cleaner query experience.
To look into this data a little further, and perform the same operation over the salesperson’s averages, we simply swap out the PARTITION BY expression:
This now tells us how each of the saleperson’s sales performed against their average.
Functions
With a basic understanding of the windowing function framework and how it can be applied to your queries, any of the supported functions can be use. The actual documentation of these functions can be found in the AWS documentation for Redshift, so I won’t reproduce that material here.
The list of functions are:
AVG
COUNT
CUME_DIST
DENSE_RANK
FIRST_VALUE
LAST_VALUE
LAG
LEAD
LISTAGG
MAX
MEDIAN
MIN
NTH_VALUE
NTILE
PERCENT_RANK
PERCENTILE_CONT
PERCENTILE_DISC
RANK
RATIO_TO_REPORT
ROW_NUMBER
STDDEV_SAMP
STDDEV_POP
SUM
VAR_SAMP
VAR_POP
In future articles I’ll cover these functions and how they can be used to perform analysis over your datasets.
This article is an extension of the window function, to follow up on each of the window functions.
The FIRST_VALUE and LAST_VALUE functions will allow you to retrieve the first and last values for a given window (respectively).
Taking the sales example for another run, we can compare every single sale for the day to see how it stacked up against the first sale of the day with the following query:
The result of which is as you’d expect:
sale_date
salesperson
?column?
sale_number
2018-05-02
Bob
26
26
2018-05-13
Sally
60
60
2018-05-13
June
156
60
2018-05-14
John
96
96
2018-05-25
Bob
192
192
2018-05-25
Sally
130
192
2018-05-26
John
156
156
2018-05-27
John
52
52
2018-05-27
June
20
52
2018-05-27
June
468
52
2018-06-02
Sally
26
26
2018-06-03
John
60
60
2018-06-03
John
156
60
2018-06-12
John
96
96
2018-06-13
Bob
192
192
2018-06-13
Sally
130
192
2018-06-15
John
156
156
2018-06-24
Bob
52
52
2018-06-24
Sally
20
52
2018-06-29
John
468
468
This demonstrates how the first value can be drawn out of a set, per window.
Inversely, you simply use the LAST_VALUE function to get the last value; if you wanted to see how each sale stacked up against the last sale of the day:
This article is an extension of the window function, to follow up on each of the window functions.
The DENSE_RANK function will use the ORDER BY expression to determine a rank of a value amongst groups. If a PARTITION BY is present, this is also included in the reset of groups.
The important difference between RANK and DENSE_RANK is that RANK will use to total number of rows to nominate it’s highest rank value; whereas DENSE_RANK will only use the number of ranks calculated, and the highest rank.
Take the following for example:
The sale_id, quantity, and unit_cost are retrieved and ranked by the quantity value. Both RANK and DENSE_RANK are used in this example to illustrate the difference in value distribution (along the number line).
As there is a change in quantity, a new rank value is assigned:
This article is an extension of the window function, to follow up on each of the window functions.
The CUME_DIST function calculates the cumulative distribution of a value, for the data that’s seen.
An example of calculating the cumulative distribution of each salesperson, and the amount of value that they’ve sold would look like this:
As sales people have sold the same amount in different sales, you can see that the cumulative distribution value makes accumulative jumps.
salesperson
value
distr
Bob
26
0.25
Bob
52
0.5
Bob
192
1.0
Bob
192
1.0
John
52
0.125
John
60
0.25
John
96
0.5
John
96
0.5
John
156
0.875
John
156
0.875
John
156
0.875
John
468
1.0
June
20
0.3333333333333333
June
156
0.6666666666666666
June
468
1.0
Sally
20
0.2
Sally
26
0.4
Sally
60
0.6
Sally
130
1.0
Sally
130
1.0
Taking an interesting piece of this window, we focus in on John.
salesperson
value
distr
John
52
0.125
John
60
0.25
John
96
0.5
John
96
0.5
John
156
0.875
John
156
0.875
John
156
0.875
John
468
1.0
The repeated value amounts don’t seem to shift the distribution value (because of its cumulativity). By the time the next amount is present, there’s a significant jump. This is a property of this set because the data is ordered by the value column.