November 18, 2022 By Lucas Gray

When it comes to writing SQL, it’s a common predicament: You start with something clear and simple that grows and grows into a tangle of subqueries and hard to follow logic.

Beekeeper Studio Is A Free & Open Source Database GUI

Best SQL query & editor tool I have ever used. It provides everything I need to manage my database. - ⭐⭐⭐⭐⭐ Mit

Beekeeper Studio is fast, intuitive, and easy to use. Beekeeper supports loads of databases, and works great on Windows, Mac and Linux.

Beekeeper's Linux version is 100% full-featured, no cut corners, no feature compromises.

Common Table Expressions (CTEs) are a powerful SQL feature that allows you to organize and streamline large queries.

What Is A CTE (Common Table Expression)?

A CTE is a way of creating a sort of temporary table that only exists for the time it takes for your query to execute.

In Postgres you define a CTE using the WITH keyword.

-- define a CTE
WITH people_who_like_cheese AS (SELECT first_name, last_name, job FROM people WHERE likes_cheese = true)


-- use the CTE like a normal table
select * from people_who_like_cheese where first_name like 'Matt%';


Not all databases support CTEs, but the most typically feature the WITH keyword. We’ll be using Postgres for the rest of this walkthrough but much of the SQL below can be used in other databases too.

Note that a CTE isn’t a real table, and only exists for the duration of your executing query, but the output of a CTE can be re-used multiple times in your query, unlike with subqueries. This has organizational and execution-efficiency benefits.

Benefits of CTEs

CTEs shine as a way to…

  1. Break down your SQL problem into smaller parts.
  2. Keep your SQL tidy and legible
  3. Reuse resultsets multiple times (unlike subqueries)

A Practical CTE Example

Consider the following use case I encountered recently.
We have two tables:

  1. deliveries
  2. clients

Each client can have many deliveries.

Answering a simple question (no CTE required here)

We want to know: How many deliveries has each client had in the last week?

Here’s a simple SQL query to answer this question

-- delivery count grouped by date and client
SELECT d.created_at::date, c.name, count(1) 
FROM deliveries d
JOIN clients c on c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
GROUP BY c.name, d.created_at::date
ORDER BY c.name DESC

The result will look something like this:

created_at name count
2016-06-23 Beekeeper 7
2016-06-24 Beekeeper 11
2016-06-27 Beekeeper 4
2016-06-28 Beekeeper 4
2016-06-29 Beekeeper 4
2016-06-27 Client A 448
2016-06-23 Client B 3
2016-06-24 Client B 3
2016-06-27 Client B 3
2016-06-28 Client B 3
2016-06-29 Client B 4
2016-06-28 Client C 21
2016-06-24 Client D 496

Pretty good for such a simple query.

Adapting the query to new requirements

A couple wild business requirements appear!

New Business Requirements

  1. We need to include dates with zero-deliveries.
  2. We only want the top 5 clients by volume.
  3. We want to exclude ourselves from the report (Beekeeper).

This shouldn’t be too hard, right?

Adding Dates With Zero Deliveries

Let’s try to tackle #1 first. We will need to start by considering every date in our daterange, instead of using only dates provided to us in the deliveries table.

It feels like there is no easy way to add this information dynamically, beyond creating a calendar table. What a pain.

Let’s see if we can combine a CTE with a nifty Postgres function named generate_series to do exactly what we need instead of creating a new table.

-- creating a CTE for all weekdays using Postgres generate_series
WITH weekdaysToUse AS (
  SELECT date 
  FROM generate_series(
    date_trunc('day', now()) - INTERVAL '7 days', 
    current_date,
    '1 day'::interval
  ) date)

The nice thing about this approach is I can now use the weekdaysToUse resultset for any subsequent CTE and the final query as well, and we’ll definitely need it.

We’re getting closer. Let’s add back the joins to deliveries and clients, then run the query and see what we get.

WITH weekdaysToUse AS (
  SELECT date 
  FROM generate_series(
    date_trunc('day', now()) - INTERVAL '7 days', 
    current_date,
    '1 day'::interval
  ) date
)

SELECT wtu.date, c.name, count(1) as cnt
FROM weekdaysToUse wtu
LEFT JOIN deliveries d ON wtu.date = d.created_at::date
JOIN clients c ON c.id = d.client_id
GROUP BY wtu.date, c.name

Result:

date name cnt
2016-06-28 Client A 4
2016-06-21 Client B 40
2016-06-21 Client C 11
2016-06-23 Client B 3
2016-06-22 Client A 4
2016-06-27 Beekeeper 448
2016-06-29 Client A 4
2016-06-21 Client A 5
2016-06-24 Client B 3
2016-06-24 Client C 496
2016-06-28 Client A 21
2016-06-27 Client A 4
2016-06-29 Beekeeper 4
2016-06-24 Client A 11
2016-06-21 Beekeeper 3
2016-06-27 Beekeeper 3
2016-06-28 Beekeeper 3
2016-06-23 Client A 7
2016-06-22 Beekeeper 4

Wait a sec, where are my zero delivery rows?

We started the query off with weekdaysToUse, but will only get a row returned where we had at least a delivery from a client for that day. Shoot.

We need a way to fill in the rows where a client did not have a delivery for that day. If we had that, we could combine it with the results from the previous, and we’d be set.

Maybe we can try to use what we’ve learned so far and build up to it with CTEs.

-- returns a row for each date/client pair. We can use this to fill the gaps
WITH gapFiller AS (
  SELECT wtu.date, c.name, 0 as cnt
  FROM weekdaysToUse wtu, clients c
  WHERE NOT EXISTS(
    SELECT 1 
    FROM deliveries d 
    WHERE d.created_at = wtu.date 
    and d.client_id = c.id
  )
)
date name cnt
2016-06-21 Client A 0
2016-06-22 Client A 0
2016-06-23 Client A 0
2016-06-24 Client A 0
2016-06-25 Client A 0
2016-06-26 Client A 0
2016-06-27 Client A 0
2016-06-28 Client A 0
2016-06-29 Client A 0
2016-06-21 Client B 0
2016-06-22 Client B 0
2016-06-23 Client B 0
2016-06-24 Client B 0
2016-06-25 Client B 0
2016-06-26 Client B 0
2016-06-27 Client B 0
2016-06-28 Client B 0
2016-06-29 Client B 0
   

If we take that resultset and UNION ALL the rows that weren’t zero earlier, we’ll be all set.

2 & 3. Fetching Only The Top 5 Clients That Aren’t Us

Remember our business requirement about top clients that aren’t us?

Here is a subquery solution,

SELECT d.created_at::date, c.name, COUNT(1) 
FROM deliveries d
JOIN clients c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
AND c.id IN (
  SELECT c1.id FROM clients c1 
  JOIN deliveries d1 ON c1.id = d1.client_id
  WHERE d1.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
    AND c1.name <> 'Beekeeper'
  GROUP BY c1.id
  ORDER BY count(d1.id) DESC
  LIMIT 5
)
group by c.name, d.created_at::date
order by c.name desc

Contrasted with the CTE solution.

WITH topFiveClientsThisWeekThatArentUs AS (
SELECT c.id, c.name FROM clients c 
  JOIN deliveries d ON c.id = d.client_id
  WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
    AND c.name <> 'Beekeeper'
  GROUP BY c.id
  ORDER BY COUNT(d.id) DESC
  LIMIT 5
)

SELECT d.created_at::date, c.name, count(1) 
FROM deliveries d
JOIN topFiveClientsThisWeekThatArentUs c ON c.id = d.client_id
WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
GROUP BY c.name, d.created_at::date
ORDER BY c.name desc

The CTE allows you to logically structure your thinking, and improves readability.

6 months from now when another developer has to edit this query they’ll thank you for using a CTE.

You also get bonus points for choosing a smart name for the CTE. Mine is long, but describes exactly what it is – top five clients this week that aren’t us.

Now for the big reveal – but first, I wrote a naive subquery solution to the problem to demonstrate the alternative:

Finaly Result: Subquery Vs CTE Solutions

Here is a subquery solution to our business problem

SELECT d AS date, name AS clientName, cnt

FROM (SELECT d FROM generate_series(
  date_trunc('day', now()) - INTERVAL '7 days',
  current_date,
  '1 day'::interval
) d) as weekdaysToUse --the days to use

JOIN ( --client/delivery combos
  SELECT d.created_at::date, c.name, d.client_id, count(1) AS cnt
  FROM clients c
  LEFT JOIN deliveries d ON d.client_id = c.id
  WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days'
  GROUP BY d.created_at::date, c.name, d.client_id 
  
  UNION ALL
  
  SELECT weekdaysToUse.d, c.name, c.id, 0 AS cnt
  FROM clients c, (SELECT d FROM generate_series(
      date_trunc('day', now()) - INTERVAL '7 days', 
      current_date,
      '1 day'::interval
    ) d) AS weekdaysToUse
  WHERE NOT EXISTS(
      SELECT 1 FROM deliveries del 
      WHERE del.created_at::date = weekdaysToUse.d
        AND del.client_id = c.id
    ) --gap filler
) AS deliveries

ON weekdaysToUse.d = deliveries.created_at

WHERE deliveries.client_id IN (
  SELECT c.id
  FROM deliveries d
  JOIN clients c on c.id = d.client_id
  WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
    AND c.name <> 'Beekeeper'
  GROUP BY c.id
  ORDER BY cnt DESC LIMIT 5
) --only top 5 that aren't us

ORDER BY date asc, clientName ASC

Yuck, all the requirements are just jumbled together and it’s hard to figure out what is important.

I also need to repeat date math and generate_series in multiple locations, since I can’t reuse it without a CTE, and don’t want to scan too much data.

Now, at last, here is our complete solution featuring CTEs.

WITH weekdaysToUse AS (
  SELECT date 
  FROM generate_series(
    date_trunc('day', now()) - INTERVAL '7 days', 
    current_date,
    '1 day'::interval
  ) date
),

topClients AS (
  SELECT c.id, c.name FROM clients c
    JOIN deliveries d ON c.id = d.client_id
    WHERE d.created_at >= date_trunc('day', now()) - INTERVAL '7 days' 
      AND c.name <> 'Beekeeper'
    GROUP BY c.id
    ORDER BY COUNT(d.id) DESC
    LIMIT 5
),

dayClientCombosWithData AS (
  SELECT wtu.date, c.name, count(1) AS cnt
  FROM weekdaysToUse wtu
  LEFT JOIN deliveries d ON wtu.date = d.created_at::date
  JOIN topClients c ON c.id = d.client_id
  GROUP BY wtu.date, c.name
),

gapFiller AS (
  SELECT wtu.date, c.name, 0 as cnt
  FROM weekdaysToUse wtu, topClients c
  WHERE NOT EXISTS(
    SELECT 1 FROM DELIVERIES d WHERE d.created_at = wtu.date
  )
)

SELECT merged.date, merged.name, max(merged.cnt)
FROM (
  SELECT date, name, cnt FROM dayClientCombosWithData
  UNION ALL
  SELECT date, name, cnt FROM gapFiller 
  ) merged
GROUP BY merged.date, merged.name
ORDER BY merged.date, merged.name

This CTE-based SQL query is straightforward, easy to understand, and easy to edit in the future. This is why CTEs are amazing.