When it comes to writing SQL, it’s a common predicament: You start with something clear and simple that grows and grows into something completely illegible. One subselect becomes a subselect within a subselect, quickly turning into a ball of mud. Enter Common Table Expressions (CTEs).
Benefits of Common Table Expressions
CTEs shine as a way to…
- Break down your problem into smaller bits
- Keep things things tidy and legible
- Reuse resultsets (unlike subqueries)
Think of CTEs as temporary resultsets you can use in the rest of your query. Sometimes they’re called WITH clauses.
Not all databases support CTEs, but the ones that do typically feature a syntax like the following…
WITH commonTableExpression AS (SELECT a,b,c FROM stuff WHERE...)
A Practical Example
Consider the following use case I encountered recently. We have two tables: deliveries and clients. Clients have many deliveries. We need to figure out how many deliveries each client has had in the last week, broken down by day. So I whipped up a simple query…
SELECT d.created_at::date, c.name, count(1)
FROM deliveries d
JOIN clients c on c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
GROUP BY c.name, d.created_at::date
ORDER BY c.name DESC
In English, we’re asking for delivery counts, broken down by client, for the last week. We get something like this –
"created_at","name","count"
"2016-06-23","Beekeeper","7"
"2016-06-24","Beekeeper","11"
"2016-06-27","Beekeeper","4"
"2016-06-28","Beekeeper","4"
"2016-06-29","Beekeeper","4"
"2016-06-27","Client A","448"
"2016-06-23","Client B","3"
"2016-06-24","Client B","3"
"2016-06-27","Client B","3"
"2016-06-28","Client B","3"
"2016-06-29","Client B","4"
"2016-06-28","Client C","21"
"2016-06-24","Client D","496"
..etc
Not too shabby for such a simple query. But a couple wild business requirements appear!
Business Requirements
- We need rows for days where no clients delivered anything (one row per client) – filling the gaps.
- If we have more than 5 clients, let’s just grab the top five for the week, and don’t include us (Beekeeper) in the report.
This shouldn’t be too hard, right?
Requirement 1: Filling Gaps
Let’s try to tackle #1 first. We will need to start by considering every date in our daterange, instead of using only dates provided to us in the deliveries table. No easy way to add that dynamically springs to mind, and the best answer that comes up in a google search involves a calendar table. At some point I suppose it usually seems necessary to create one to join to for these sorts of things. But what a pain – let’s just write some SQL!
Let’s see if we can combine a CTE with a nifty Postgres function named generate_series to do exactly what we need.
WITH weekdaysToUse AS (
SELECT date
FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) date)
Great, CTEs and a Postgres function bailed us out. The nice thing about this approach is I can now use the weekdaysToUse
resultset for any subsequent CTE and the final query as well, and we’ll definitely need it.
We’re getting closer. Let’s add the joins to deliveries and clients, then run the query and see what we get.
WITH weekdaysToUse AS (
SELECT date
FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) date
)
SELECT wtu.date, c.name, count(1) as cnt
FROM weekdaysToUse wtu
LEFT JOIN deliveries d ON wtu.date = d.created_at::date
JOIN clients c ON c.id = d.client_id
GROUP BY wtu.date, c.name
Result:
"date","name","cnt"
"2016-06-28","Client A","4"
"2016-06-21","Client B","40"
"2016-06-21","Client C","11"
"2016-06-23","Client B","3"
"2016-06-22","Client A","4"
"2016-06-27","Beekeeper","448"
"2016-06-29","Client A","4"
"2016-06-21","Client A","5"
"2016-06-24","Client B","3"
"2016-06-24","Client C","496"
"2016-06-28","Client A","21"
"2016-06-27","Client A","4"
"2016-06-29","Beekeeper","4"
"2016-06-24","Client A","11"
"2016-06-21","Beekeeper","3"
"2016-06-27","Beekeeper","3"
"2016-06-28","Beekeeper","3"
"2016-06-23","Client A","7"
"2016-06-22","Beekeeper","4"
Wait a sec, where are my zero delivery rows? We started the query off with weekdaysToUse, but will only get a row returned where we had at least a delivery from a client for that day. Shoot. Let’s make a query to fill in rows where a client did not have a delivery for that day. If we had that resultset, we could combine it with the results from the previous, and we’d be set.
Maybe we can try to use what we’ve learned so far and build up to it with CTEs.
WITH gapFiller AS (
SELECT wtu.date, c.name, 0 as cnt
FROM weekdaysToUse wtu, clients c
WHERE NOT EXISTS(
SELECT 1
FROM deliveries d
WHERE d.created_at = wtu.date
and d.client_id = c.id
)
)
"date","name","cnt"
"2016-06-21","Client A","0"
"2016-06-22","Client A","0"
"2016-06-23","Client A","0"
"2016-06-24","Client A","0"
"2016-06-25","Client A","0"
"2016-06-26","Client A","0"
"2016-06-27","Client A","0"
"2016-06-28","Client A","0"
"2016-06-29","Client A","0"
"2016-06-21","Client B","0"
"2016-06-22","Client B","0"
"2016-06-23","Client B","0"
"2016-06-24","Client B","0"
"2016-06-25","Client B","0"
"2016-06-26","Client B","0"
"2016-06-27","Client B","0"
"2016-06-28","Client B","0"
"2016-06-29","Client B","0"
"2016-06-21","Client C","0"
"2016-06-22","Client C","0"
"2016-06-23","Client C","0"
"2016-06-24","Client C","0"
"2016-06-25","Client C","0"
"2016-06-26","Client C","0"
"2016-06-27","Client C","0"
"2016-06-28","Client C","0"
"2016-06-29","Client C","0"
"2016-06-21","Client D","0"
"2016-06-22","Client D","0"
"2016-06-23","Client D","0"
"2016-06-24","Client D","0"
"2016-06-25","Client D","0"
"2016-06-26","Client D","0"
"2016-06-27","Client D","0"
"2016-06-28","Client D","0"
"2016-06-29","Client D","0"
If we take that resultset and UNION ALL
the rows that weren’t zero earlier, we’ll be all set.
Requirement 2: Using Only Top Clients
Remember our business requirement about top clients that aren’t us?
Here is a subquery solution,
SELECT d.created_at::date, c.name, COUNT(1)
FROM deliveries d
JOIN clients c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.id IN (
SELECT c1.id FROM clients c1
JOIN deliveries d1 ON c1.id = d1.client_id
WHERE d1.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c1.name <> 'Beekeeper'
GROUP BY c1.id
ORDER BY count(d1.id) DESC
LIMIT 5
)
group by c.name, d.created_at::date
order by c.name desc
Contrasted with the CTE solution.
WITH topFiveClientsThisWeekThatArentUs AS (
SELECT c.id, c.name FROM clients c
JOIN deliveries d ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.name <> 'Beekeeper'
GROUP BY c.id
ORDER BY COUNT(d.id) DESC
LIMIT 5
)
SELECT d.created_at::date, c.name, count(1)
FROM deliveries d
JOIN topFiveClientsThisWeekThatArentUs c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
GROUP BY c.name, d.created_at::date
ORDER BY c.name desc
The CTE allows you to logically structure your thinking, and improves readability 6 months from now when another developer has to grasp what the query you wrote was trying to do. Bonus points for choosing a smart name for the intermediate resultset. Mine is long, but describes exactly what it is – top five clients this week that aren’t us.
Putting It All Together
Now for the big reveal – but first, I wrote a naive subquery solution to the problem to demonstrate the alternative:
Subquery Solution
SELECT d AS date, name AS clientName, cnt
FROM (SELECT d FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) d) as weekdaysToUse --the days to use
JOIN ( --client/delivery combos
SELECT d.created_at::date, c.name, d.client_id, count(1) AS cnt
FROM clients c
LEFT JOIN deliveries d ON d.client_id = c.id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
GROUP BY d.created_at::date, c.name, d.client_id
UNION ALL
SELECT weekdaysToUse.d, c.name, c.id, 0 AS cnt
FROM clients c, (SELECT d FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) d) AS weekdaysToUse
WHERE NOT EXISTS(
SELECT 1 FROM deliveries del
WHERE del.created_at::date = weekdaysToUse.d
AND del.client_id = c.id
) --gap filler
) AS deliveries
ON weekdaysToUse.d = deliveries.created_at
WHERE deliveries.client_id IN (
SELECT c.id
FROM deliveries d
JOIN clients c on c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.name <> 'Beekeeper'
GROUP BY c.id
ORDER BY cnt DESC LIMIT 5
) --only top 5 that aren't us
ORDER BY date asc, clientName ASC
Yuck, all the requirements are just jumbled together. I also need to specify the date math and generate series all over the place, since I can’t reuse it without a CTE, and don’t want to scan too much data.
Now, at last, here is our complete solution featuring CTEs.
CTE Solution
WITH weekdaysToUse AS (
SELECT date
FROM generate_series(
date_trunc('day', now()) - INTERVAL '7 days',
current_date,
'1 day'::interval
) date
),
topClients AS (
SELECT c.id, c.name FROM clients c
JOIN deliveries d ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
AND c.name <> 'Beekeeper'
GROUP BY c.id
ORDER BY COUNT(d.id) DESC
LIMIT 5
),
dayClientCombosWithData AS (
SELECT wtu.date, c.name, count(1) AS cnt
FROM weekdaysToUse wtu
LEFT JOIN deliveries d ON wtu.date = d.created_at::date
JOIN topClients c ON c.id = d.client_id
GROUP BY wtu.date, c.name
),
gapFiller AS (
SELECT wtu.date, c.name, 0 as cnt
FROM weekdaysToUse wtu, topClients c
WHERE NOT EXISTS(
SELECT 1 FROM DELIVERIES d WHERE d.created_at = wtu.date
)
)
SELECT merged.date, merged.name, max(merged.cnt)
FROM (
SELECT date, name, cnt FROM dayClientCombosWithData
UNION ALL
SELECT date, name, cnt FROM gapFiller
) merged
GROUP BY merged.date, merged.name
ORDER BY merged.date, merged.name
Straightforward, easy to understand at every step. Not too shabby!