This article is an extension of the window function, to follow up on each of the window functions.
The DENSE_RANK function will use the ORDER BY expression to determine a rank of a value amongst groups. If a PARTITION BY is present, this is also included in the reset of groups.
The important difference between RANK and DENSE_RANK is that RANK will use to total number of rows to nominate it’s highest rank value; whereas DENSE_RANK will only use the number of ranks calculated, and the highest rank.
The sale_id, quantity, and unit_cost are retrieved and ranked by the quantity value. Both RANK and DENSE_RANK are used in this example to illustrate the difference in value distribution (along the number line).
As there is a change in quantity, a new rank value is assigned:
As sales people have sold the same amount in different sales, you can see that the cumulative distribution value makes accumulative jumps.
salesperson
value
distr
Bob
26
0.25
Bob
52
0.5
Bob
192
1.0
Bob
192
1.0
John
52
0.125
John
60
0.25
John
96
0.5
John
96
0.5
John
156
0.875
John
156
0.875
John
156
0.875
John
468
1.0
June
20
0.3333333333333333
June
156
0.6666666666666666
June
468
1.0
Sally
20
0.2
Sally
26
0.4
Sally
60
0.6
Sally
130
1.0
Sally
130
1.0
Taking an interesting piece of this window, we focus in on John.
salesperson
value
distr
John
52
0.125
John
60
0.25
John
96
0.5
John
96
0.5
John
156
0.875
John
156
0.875
John
156
0.875
John
468
1.0
The repeated value amounts don’t seem to shift the distribution value (because of its cumulativity). By the time the next amount is present, there’s a significant jump. This is a property of this set because the data is ordered by the value column.
We list the sale_date and salesperson out for every sale on record. The COUNT partitions by the date of the sale made. This makes the sale_number tell us what number the sale was for that given date.
sale_date
salesperson
sales_count
2018-05-02
Bob
1
2018-05-13
Sally
1
2018-05-13
June
2
2018-05-14
John
1
2018-05-25
Bob
1
2018-05-25
Sally
2
2018-05-26
John
1
2018-05-27
John
1
2018-05-27
June
2
2018-05-27
June
3
2018-06-02
Sally
1
2018-06-03
John
1
2018-06-03
John
2
2018-06-12
John
1
2018-06-13
Bob
1
2018-06-13
Sally
2
2018-06-15
John
1
2018-06-24
Bob
1
2018-06-24
Sally
2
2018-06-29
John
1
Taking 2018-05-27, we can see that John got the first sale, June got the second and third.
This takes the average cost value (quantity * unit_cost), but personalises the value by salesperson. Each salesperson is going to have their average calculated in isolation because of PARTITION BY salesperson.
A little while ago, I’d written an article on Range generation in PostgreSQL, and whilst Redshift and PostgreSQL are similar database beasts, they are quite different in terms of their query capabilities due to their different strengths.
In today’s article, we’ll walk through a couple of different solutions to range generation.
generate_series
The generate_series is a pretty serious tool in terms of achieving exactly what this article is talking about. Unfortunately, there is no analog for this function within the Redshift database management system. Who knows; maybe one day.
The next CTE seq uses the digits CTE, 4 times. Each time its treated as an increasing power of 10. a handles the 10^0, b handles the 10^1, etc. This catersian effect quickly multiplies our 0-9 range up to 0-9999. If more numbers are required, we simple add another CROSS JOIN, and treat the next power of 10.
That’s it as far a number generation. It’s pretty easy.
I’ve seen other developers materialise this information into a physical table, and simply join to that table also.
Extending into time
Applying this into some date ranges now, we use the seq CTE and a little date arithmatic. My example here just uses the numbers within seq to enumerate dates. Adding LIMIT to this query also prevents us from utilising all 10000 values produced by seq.