SQL Examples: Window Functions¶

This notebook explores advanced SQL window functions using the Chinook database.

Through seven progressively structured prompts, it focuses on mastering the use of ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, and cumulative aggregates for real-world analytical tasks.


Setup¶

In [1]:
from pathlib import Path
from IPython.display import Image, display
import pandas as pd

# project root
repo_root = Path().resolve().parent  

# database path relative to project root
DB_PATH = repo_root / "data" / "Chinook_Sqlite.sqlite"

assert DB_PATH.exists(), f"Database file not found: {DB_PATH}"

# SQLAlchemy engine
from sqlalchemy import create_engine
engine = create_engine(f"sqlite:///{DB_PATH}")

# image path relative to repo root
image_path = repo_root / "input" / "ChinookDBStructure.png"

# check it exists
assert image_path.exists(), f"Image file not found: {image_path}"

# display
display(Image(filename=str(image_path), width=600, height=400))
No description has been provided for this image

Query Examples¶

1) Assign a rank to customers based on total revenue.¶

In [22]:
# aggregating over CustomerId with sum to calculate each customer's total_revenue
# using DESNSE_RANK to rank customers by their total revenue from invoices

query = """

SELECT 
     c.CustomerId,
     TRIM(c.FirstName || ' ' || c.LastName) AS customer_name,
     COALESCE(SUM(i.Total),0) AS total_revenue,
     DENSE_RANK() OVER(ORDER BY COALESCE(SUM(i.Total), 0) DESC) AS total_revenue_rank
FROM customer AS c
LEFT JOIN invoice AS i
  ON c.CustomerId = i.CustomerId
GROUP BY c.CustomerId, c.FirstName, c.LastName
ORDER BY total_revenue DESC ;

"""
df = pd.read_sql_query(query, engine)
df.head(10) # showing top ten
Out[22]:
CustomerId customer_name total_revenue total_revenue_rank
0 6 Helena Holý 49.62 1
1 26 Richard Cunningham 47.62 2
2 57 Luis Rojas 46.62 3
3 45 Ladislav Kovács 45.62 4
4 46 Hugh O'Reilly 45.62 4
5 24 Frank Ralston 43.62 5
6 28 Julia Barnett 43.62 5
7 37 Fynn Zimmermann 43.62 5
8 7 Astrid Gruber 42.62 6
9 25 Victor Stevens 42.62 6
In [21]:
# CTE version for strict SQL environements

query = """

WITH customer_invoice_totals AS(
SELECT 
     c.CustomerId,
     TRIM(c.FirstName || ' ' || c.LastName) AS customer_name,
     COALESCE(SUM(i.Total),0) AS total_revenue   
FROM customer AS c
LEFT JOIN invoice AS i
  ON c.CustomerId = i.CustomerId
GROUP BY c.CustomerId, c.FirstName, c.LastName
)
SELECT 
     CustomerId,
     customer_name,
     total_revenue,
     DENSE_RANK() OVER(ORDER BY total_revenue DESC) AS total_revenue_rank
FROM customer_invoice_totals
ORDER BY total_revenue DESC ;

"""
df = pd.read_sql_query(query, engine)
df.head(10) # showing top ten
Out[21]:
CustomerId customer_name total_revenue total_revenue_rank
0 6 Helena Holý 49.62 1
1 26 Richard Cunningham 47.62 2
2 57 Luis Rojas 46.62 3
3 45 Ladislav Kovács 45.62 4
4 46 Hugh O'Reilly 45.62 4
5 24 Frank Ralston 43.62 5
6 28 Julia Barnett 43.62 5
7 37 Fynn Zimmermann 43.62 5
8 7 Astrid Gruber 42.62 6
9 25 Victor Stevens 42.62 6

2) Compute cumulative revenue over time per customer.¶

In [9]:
# cte with revenue by year followed with a cumulative sum windows function

query = """

WITH customer_revenue_by_year AS(
    SELECT
         TRIM(c.FirstName || ' ' || c.LastName) AS customer_name,
         CAST(strftime('%Y', i.InvoiceDate) AS INTEGER) AS year, 
         COALESCE(SUM(i.Total), 0) AS total_revenue
FROM customer AS c
JOIN invoice AS i
  ON c.CustomerId = i.CustomerId
GROUP BY strftime('%Y', i.InvoiceDate), c.FirstName, c.LastName
)
SELECT 
     customer_name,
     year,
     total_revenue,
     SUM(total_revenue) OVER(PARTITION BY customer_name ORDER BY year) AS cumulative_revenue
FROM customer_revenue_by_year
ORDER BY customer_name, year

"""
df = pd.read_sql_query(query, engine)
df.head(10) # showing top ten
Out[9]:
customer_name year total_revenue cumulative_revenue
0 Aaron Mitchell 2021 15.84 15.84
1 Aaron Mitchell 2022 8.91 24.75
2 Aaron Mitchell 2023 1.98 26.73
3 Aaron Mitchell 2024 9.90 36.63
4 Aaron Mitchell 2025 0.99 37.62
5 Alexandre Rocha 2021 15.84 15.84
6 Alexandre Rocha 2022 8.91 24.75
7 Alexandre Rocha 2024 11.88 36.63
8 Alexandre Rocha 2025 0.99 37.62
9 Astrid Gruber 2021 1.98 1.98

3) Find the previous invoice date for each customer (LAG).¶

In [10]:
query = """

SELECT 
     c.CustomerId,
     TRIM(c.FirstName || ' ' || c.LastName) AS customer_name,
     i.InvoiceDate,
     LAG(i.InvoiceDate) OVER(
       PARTITION BY c.CustomerId 
       ORDER BY i.InvoiceDate, i.InvoiceId
     ) AS previous_invoice_date
FROM customer AS c
    JOIN invoice AS i
      ON c.CustomerId = i.CustomerId
ORDER BY c.CustomerId, c.FirstName, c.LastName, i.InvoiceDate ;

"""
df = pd.read_sql_query(query, engine)
df.head(10) # showing top ten
Out[10]:
CustomerId customer_name InvoiceDate previous_invoice_date
0 1 Luís Gonçalves 2022-03-11 00:00:00 None
1 1 Luís Gonçalves 2022-06-13 00:00:00 2022-03-11 00:00:00
2 1 Luís Gonçalves 2022-09-15 00:00:00 2022-06-13 00:00:00
3 1 Luís Gonçalves 2023-05-06 00:00:00 2022-09-15 00:00:00
4 1 Luís Gonçalves 2024-10-27 00:00:00 2023-05-06 00:00:00
5 1 Luís Gonçalves 2024-12-07 00:00:00 2024-10-27 00:00:00
6 1 Luís Gonçalves 2025-08-07 00:00:00 2024-12-07 00:00:00
7 2 Leonie Köhler 2021-01-01 00:00:00 None
8 2 Leonie Köhler 2021-02-11 00:00:00 2021-01-01 00:00:00
9 2 Leonie Köhler 2021-10-12 00:00:00 2021-02-11 00:00:00

4) Show the next track each customer purchased (LEAD).¶

5) Compute running totals of revenue per month.¶

6) Rank tracks within each genre by price.¶

7) Show the percentile rank of customers by total purchase amount.¶

In [ ]: