When it comes to writing SQL, it’s a common predicament: You start with something clear and simple that grows and grows into a tangle of subqueries and hard to follow logic.
Beekeeper Studio Is A Free & Open Source Database GUI
Best SQL query & editor tool I have ever used. It provides everything I need to manage my database. - ⭐⭐⭐⭐⭐ Mit
Beekeeper Studio is fast, intuitive, and easy to use. Beekeeper supports loads of databases, and works great on Windows, Mac and Linux.
Common Table Expressions (CTEs) are a powerful SQL feature that allows you to organize and streamline large queries.
What Is A CTE (Common Table Expression)?
A CTE is a way of creating a sort of temporary table that only exists for the time it takes for your query to execute.
In Postgres you define a CTE using the WITH
keyword.
-- define a CTE
WITH people_who_like_cheese AS (SELECT first_name, last_name, job FROM people WHERE likes_cheese = true)
-- use the CTE like a normal table
select * from people_who_like_cheese where first_name like 'Matt%';
Not all databases support CTEs, but the most typically feature the WITH
keyword. We’ll be using Postgres for the rest of this walkthrough but much of the SQL below can be used in other databases too.
Note that a CTE isn’t a real table, and only exists for the duration of your executing query, but the output of a CTE can be re-used multiple times in your query, unlike with subqueries. This has organizational and execution-efficiency benefits.
Benefits of CTEs
CTEs shine as a way to…
- Break down your SQL problem into smaller parts.
- Keep your SQL tidy and legible
- Reuse resultsets multiple times (unlike subqueries)
A Practical CTE Example
Consider the following use case I encountered recently.
We have two tables:
- deliveries
- clients
Each client
can have many deliveries
.
Answering a simple question (no CTE required here)
We want to know: How many deliveries has each client had in the last week?
Here’s a simple SQL query to answer this question
-- delivery count grouped by date and client
SELECT d.created_at::date, c.name, count(1)
FROM deliveries d
JOIN clients c on c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
GROUP BY c.name, d.created_at::date
ORDER BY c.name DESC
The result will look something like this:
created_at | name | count |
---|---|---|
2016-06-23 | Beekeeper | 7 |
2016-06-24 | Beekeeper | 11 |
2016-06-27 | Beekeeper | 4 |
2016-06-28 | Beekeeper | 4 |
2016-06-29 | Beekeeper | 4 |
2016-06-27 | Client A | 448 |
2016-06-23 | Client B | 3 |
2016-06-24 | Client B | 3 |
2016-06-27 | Client B | 3 |
2016-06-28 | Client B | 3 |
2016-06-29 | Client B | 4 |
2016-06-28 | Client C | 21 |
2016-06-24 | Client D | 496 |
Pretty good for such a simple query.
Adapting the query to new requirements
A couple wild business requirements appear!
New Business Requirements
- We need to include dates with zero-deliveries.
- We only want the top 5 clients by volume.
- We want to exclude ourselves from the report (Beekeeper).
This shouldn’t be too hard, right?
Adding Dates With Zero Deliveries
Let’s try to tackle #1 first. We will need to start by considering every date in our daterange, instead of using only dates provided to us in the deliveries table.
It feels like there is no easy way to add this information dynamically, beyond creating a calendar
table. What a pain.
Let’s see if we can combine a CTE with a nifty Postgres function named generate_series to do exactly what we need instead of creating a new table.
-- creating a CTE for all weekdays using Postgres generate_series
WITH weekdaysToUse AS (
SELECT date
FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) date)
The nice thing about this approach is I can now use the weekdaysToUse
resultset for any subsequent CTE and the final query as well, and we’ll definitely need it.
We’re getting closer. Let’s add back the joins to deliveries and clients, then run the query and see what we get.
WITH weekdaysToUse AS (
SELECT date
FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) date
)
SELECT wtu.date, c.name, count(1) as cnt
FROM weekdaysToUse wtu
LEFT JOIN deliveries d ON wtu.date = d.created_at::date
JOIN clients c ON c.id = d.client_id
GROUP BY wtu.date, c.name
Result:
date | name | cnt |
---|---|---|
2016-06-28 | Client A | 4 |
2016-06-21 | Client B | 40 |
2016-06-21 | Client C | 11 |
2016-06-23 | Client B | 3 |
2016-06-22 | Client A | 4 |
2016-06-27 | Beekeeper | 448 |
2016-06-29 | Client A | 4 |
2016-06-21 | Client A | 5 |
2016-06-24 | Client B | 3 |
2016-06-24 | Client C | 496 |
2016-06-28 | Client A | 21 |
2016-06-27 | Client A | 4 |
2016-06-29 | Beekeeper | 4 |
2016-06-24 | Client A | 11 |
2016-06-21 | Beekeeper | 3 |
2016-06-27 | Beekeeper | 3 |
2016-06-28 | Beekeeper | 3 |
2016-06-23 | Client A | 7 |
2016-06-22 | Beekeeper | 4 |
Wait a sec, where are my zero delivery rows?
We started the query off with weekdaysToUse, but will only get a row returned where we had at least a delivery from a client for that day. Shoot.
We need a way to fill in the rows where a client did not have a delivery for that day. If we had that, we could combine it with the results from the previous, and we’d be set.
Maybe we can try to use what we’ve learned so far and build up to it with CTEs.
-- returns a row for each date/client pair. We can use this to fill the gaps
WITH gapFiller AS (
SELECT wtu.date, c.name, 0 as cnt
FROM weekdaysToUse wtu, clients c
WHERE NOT EXISTS(
SELECT 1
FROM deliveries d
WHERE d.created_at = wtu.date
and d.client_id = c.id
)
)
date | name | cnt |
---|---|---|
2016-06-21 | Client A | 0 |
2016-06-22 | Client A | 0 |
2016-06-23 | Client A | 0 |
2016-06-24 | Client A | 0 |
2016-06-25 | Client A | 0 |
2016-06-26 | Client A | 0 |
2016-06-27 | Client A | 0 |
2016-06-28 | Client A | 0 |
2016-06-29 | Client A | 0 |
2016-06-21 | Client B | 0 |
2016-06-22 | Client B | 0 |
2016-06-23 | Client B | 0 |
2016-06-24 | Client B | 0 |
2016-06-25 | Client B | 0 |
2016-06-26 | Client B | 0 |
2016-06-27 | Client B | 0 |
2016-06-28 | Client B | 0 |
2016-06-29 | Client B | 0 |
… |
If we take that resultset and UNION ALL
the rows that weren’t zero earlier, we’ll be all set.
2 & 3. Fetching Only The Top 5 Clients That Aren’t Us
Remember our business requirement about top clients that aren’t us?
Here is a subquery solution,
SELECT d.created_at::date, c.name, COUNT(1)
FROM deliveries d
JOIN clients c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.id IN (
SELECT c1.id FROM clients c1
JOIN deliveries d1 ON c1.id = d1.client_id
WHERE d1.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c1.name <> 'Beekeeper'
GROUP BY c1.id
ORDER BY count(d1.id) DESC
LIMIT 5
)
group by c.name, d.created_at::date
order by c.name desc
Contrasted with the CTE solution.
WITH topFiveClientsThisWeekThatArentUs AS (
SELECT c.id, c.name FROM clients c
JOIN deliveries d ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.name <> 'Beekeeper'
GROUP BY c.id
ORDER BY COUNT(d.id) DESC
LIMIT 5
)
SELECT d.created_at::date, c.name, count(1)
FROM deliveries d
JOIN topFiveClientsThisWeekThatArentUs c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
GROUP BY c.name, d.created_at::date
ORDER BY c.name desc
The CTE allows you to logically structure your thinking, and improves readability.
6 months from now when another developer has to edit this query they’ll thank you for using a CTE.
You also get bonus points for choosing a smart name for the CTE. Mine is long, but describes exactly what it is – top five clients this week that aren’t us.
Now for the big reveal – but first, I wrote a naive subquery solution to the problem to demonstrate the alternative:
Finaly Result: Subquery Vs CTE Solutions
Here is a subquery solution to our business problem
SELECT d AS date, name AS clientName, cnt
FROM (SELECT d FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) d) as weekdaysToUse --the days to use
JOIN ( --client/delivery combos
SELECT d.created_at::date, c.name, d.client_id, count(1) AS cnt
FROM clients c
LEFT JOIN deliveries d ON d.client_id = c.id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
GROUP BY d.created_at::date, c.name, d.client_id
UNION ALL
SELECT weekdaysToUse.d, c.name, c.id, 0 AS cnt
FROM clients c, (SELECT d FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) d) AS weekdaysToUse
WHERE NOT EXISTS(
SELECT 1 FROM deliveries del
WHERE del.created_at::date = weekdaysToUse.d
AND del.client_id = c.id
) --gap filler
) AS deliveries
ON weekdaysToUse.d = deliveries.created_at
WHERE deliveries.client_id IN (
SELECT c.id
FROM deliveries d
JOIN clients c on c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.name <> 'Beekeeper'
GROUP BY c.id
ORDER BY cnt DESC LIMIT 5
) --only top 5 that aren't us
ORDER BY date asc, clientName ASC
Yuck, all the requirements are just jumbled together and it’s hard to figure out what is important.
I also need to repeat date math and generate_series
in multiple locations, since I can’t reuse it without a CTE, and don’t want to scan too much data.
Now, at last, here is our complete solution featuring CTEs.
WITH weekdaysToUse AS (
SELECT date
FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) date
),
topClients AS (
SELECT c.id, c.name FROM clients c
JOIN deliveries d ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.name <> 'Beekeeper'
GROUP BY c.id
ORDER BY COUNT(d.id) DESC
LIMIT 5
),
dayClientCombosWithData AS (
SELECT wtu.date, c.name, count(1) AS cnt
FROM weekdaysToUse wtu
LEFT JOIN deliveries d ON wtu.date = d.created_at::date
JOIN topClients c ON c.id = d.client_id
GROUP BY wtu.date, c.name
),
gapFiller AS (
SELECT wtu.date, c.name, 0 as cnt
FROM weekdaysToUse wtu, topClients c
WHERE NOT EXISTS(
SELECT 1 FROM DELIVERIES d WHERE d.created_at = wtu.date
)
)
SELECT merged.date, merged.name, max(merged.cnt)
FROM (
SELECT date, name, cnt FROM dayClientCombosWithData
UNION ALL
SELECT date, name, cnt FROM gapFiller
) merged
GROUP BY merged.date, merged.name
ORDER BY merged.date, merged.name
This CTE-based SQL query is straightforward, easy to understand, and easy to edit in the future. This is why CTEs are amazing.