A Beginner's Guide to Understanding SQL Window Functions - Part 2

Published on
21-01-2024
Author
Product Minting
Category
Interviews
https://cdn.aisys.pro/stories/a-beginners-guide-to-understanding-sql-window-functions-part-2.jpg

In the previous article, we talked about window functions — being a great tool for solving various analytical problems; they can give you access to features like advanced analytics and data manipulation without the need to write complex queries. Before you continue reading, I would recommend starting with the first part as it will help to grasp the basic idea of how window functions work in SQL.


So now, that you are familiar with the basics, let us explore more advanced SQL concepts. It may sound a bit complicated at first glance, but I will provide simple examples suitable for beginners for all the possible cases, so it will be easy to catch my drift.


Content Overview

  • Cumulative Sum
  • Ranking Window Functions
  • Use cases
  • Offset window functions
  • Key Takeaways

Cumulative Sum

We have already considered examples where over() expression either had no parameters or had a partition by parameter. Now, we'll look at the second possible parameter for over() expression — order by.


Let's request the employee ID, employee name, department, salary, and the sum of all salaries:

select
employee_id,
employee_name,
department,
salary,
sum(salary) over()
from salary


image


Now, we'll add the order by parameter to the over() expression:

select
employee_id,
employee_name,
department,
salary,
sum(salary) over(order by employee_id desc)
from salary


image


I guess we need to take a closer look at what happened here:


  1. First of all, employee_id is now sorted in descending order.


  2. In the column resulting from the application of the window function, there is now a cumulative sum.


I believe you are familiar with cumulative sum. Its essence is simple — the cumulative sum or running total means "how much so far." The definition of the cumulative sum is the sum of a given sequence that is increasing or getting bigger with more additions.


Here is what we have in our example: for the employee with the highest employee_id value, the salary is 3700, and the cumulative sum is also 3700. The second employee has a salary of 1500, and the cumulative sum is 5200. The third employee, with a salary of 2900, has a cumulative sum of 8100, and so on.


The order by parameter in the over() expression specifies the order. In the case of aggregating window functions, it determines the order for the cumulative total.


In the over() expression, both the partition by and order by attributes can be specified.

select
employee_id,
employee_name,
department,
salary,
sum(salary) over(partition by department order by employee_id desc)
from salary



image

In this case, the cumulative total will be calculated by sections.


NB! If both attributes are specified in the over() expression, the partition by always comes first, followed by the order by. For example: over(partition by department order by employee_id).


After we have discussed the cumulative sum, we need to say that perhaps it's the only type of cumulative total that is frequently used. Cumulative average and cumulative count are, on the contrary, seldom used.


Nevertheless, we will give an example of the Cumulative average calculation — it tells us the average of a series of values up to a certain point:

select
employee_id,
employee_name,
department,
salary,
avg(salary) over(order by employee_id desc)
from salary


image


Ranking Window Functions

We use the ranking window functions to determine the position of a value within a set of values. The ORDER BY expression within the OVER clause dictates the basis for ranking, with each value assigned a rank within its designated partition. When rows share identical values for the ranking criteria, they are assigned the same rank.


To see how the ranking window functions work, let's request the following columns from the salary table: employee ID, employee name, department, and salary:

select 
employee_id, 
employee_name, 
department, 
salary
from salary


image

Now, we add one more column with the window function row_number() over():

select
employee_id,
employee_name,
department,
salary,
row_number() over()
from salary


image


The window function row_number() over() has assigned numbers to the rows without changing their order. So far, this doesn't bring much value to us, does it?


But what if we want to number the rows in descending order of salary? To achieve this, we need to specify the sorting order, in other words, pass the order by parameter to the over() expression.

select
employee_id,
employee_name,
department,
salary,
row_number() over(order by salary desc)
from salary


image

We'll add the remaining ranking functions to the query for comparison:

select
employee_id,
employee_name,
department,
salary,
row_number() over(order by salary desc),
rank() over(order by salary desc),
dense_rank() over(order by salary desc),
percent_rank() over(order by salary desc),
ntile(5) over(order by salary desc)
from salary


Let's go through each ranking window function:


  1. The window function row_number() over(order by salary desc) ranks the rows in descending order of salary and assigns row numbers. Note that Annie and Tony have the same salary, but they are assigned different numbers.


  2. The window function rank() over(order by salary desc) assigns ranks in descending order of salary. It assigns the same rank for identical values, but the next value gets a new row number.


  3. The window function dense_rank() over(order by salary desc) assigns ranks in descending order of salary. It assigns the same rank for identical values.


  4. The window function percent_rank() over(order by salary desc) is the relative (percentage) rank of the current row, calculated by the formula: (rank - 1) / (total number of rows in the partition - 1).


  5. The window function ntile(5) over(order by salary desc) divides the number of rows into 5 equal parts and assigns a number to each part. The number of parts is specified inside the ntile(5) function.


NB! Unlike aggregate functions, e.g., sum(salary), ranking functions, e.g., row_number(), don't take a column inside. However, in the ntile(5) function, the number of parts is specified.

Use Cases

It's time to explore practical tasks using ranking window functions. We'll display the employee ID, employee name, department, and salary, and assign row numbers in descending order of salary.

select
employee_id,
employee_name,
department,
salary,
row_number() over(order by salary desc)
from salary


image


Sometimes, you might need to number rows in descending order of salary within departments (sections). This can be done by adding the partition by attribute to the over() expression:

select
employee_id,
employee_name,
department,
salary,
row_number() over(partition by department order by salary desc)
from salary


image

Let's make the task more challenging. We need to retain only one employee per department with the highest salary. This can be achieved using a subquery:

select *
from
(
select
employee_id,
employee_name,
department,
salary,
row_number() over(partition by department order by salary desc) as rn
from salary
) as t2
where rn = 1


image

And one more example, if we need to display three employees in each city with the highest salary, we'll do the following:

select *
from
(
select
employee_id,
employee_name,
city,
salary,
row_number() over(partition by city order by salary desc) as rn
from salary
) as t2
where rn <= 3



image

These kinds of tasks are very common, especially when you need to display a specific number of rows within sections (groups) in ascending or descending order of some attribute. In practice, I consistently use the window function row_number() over(), and, of course, dense_rank() over() as well.

Offset Window Functions

These functions allow you to return data from other rows based on their distance away from the current row. To make it more visual, let us go through the first_value(), last_value(), and nth_value() functions.

select
t1.*,
first_value(salary)over(partition by department),
last_value(salary)over(partition by department),
nth_value(salary,2)over(partition by department)
from salary as t1
order by department


image

NB! In all three window functions, it is essential to specify the partition by parameter. The order by parameter is not mandatory, but by specifying it, you can alter the order of rows within the partition. For example, in the query below, we sorted by salary within the section (department), and now the first_value is the highest salary in the section.

select
t1.*,
first_value(salary)over(partition by department order by salary decs),
last_value(salary)over(partition by department order by salary decs),
nth_value(salary,2)over(partition by department order by salary decs)
from salary as t1
order by department


The functions first_value(salary) over(partition by department) and last_value(salary) over(partition by department) display the first and last salary values within the section (department).


In its turn, the function nth_value(salary, 2) over(partition by department) shows the second salary value within the section (department). Please note that in nth_value(), an additional argument is specified – the row number within the section. In our case, the row number is 2, so the function displays the second salary value.


Apart from the above, there are also lag() and lead() functions. The lag() function is used to get value from the row that precedes the current row. The lead() function is used to get value from a row that succeeds the current row.

select
t1.*,
lag(salary)over(order by salary),
lead(salary)over(order by salary)
from salary as t1


image
As you can see, the function lag (salary) over (order by salary) shifts the salaries down by one row, and the function lead(salary) over(order by salary) shifts the salaries up by one row. Though these functions are quite similar, I find it more convenient to use lag().

NB! For these functions, it is mandatory to specify the order by parameter in the over() expression. You can also specify partitioning by using partition by, but it is not mandatory.

select
t1.*,
lag(salary)over(partition by department order by salary)
from salary as t1
order by department


image
Here, lag() performs the same function as before, but now specifically within sections (departments).


Key Takeaways

And finally, a quick overview of what we have covered today:


  • The cumulative sum represents the running total of a sequence, accumulating with each subsequent addition.


  • Ranking window functions are used to determine the position of a value within a set of values, with the order by expression specifying the basis for ranking.


  • Offset window functions include first_value(), last_value(), and nth_value(), enabling retrieval of data from other rows based on their distance from the current row. Don't forget about lag() and lead() functions. The lag() function may be handy to get value from the row that precedes the current row, while the lead() function is used to get value from a row that succeeds the current row.


Thanks for joining me. I hope this article helps you better understand the capabilities of window functions in SQL and makes you more confident and quick in routine tasks.

Discussion (20)

Not yet any reply