The following YouTube video prompted me to write this note.
https://www.youtube.com/watch?v=1p7gt7iu6jo&list=PLJMaoEWvHwFKP7uF4l1pXIqMEJ_RatDSq&index=18
Sometimes databases store sparse data, just the data of interest and sometimes we want to pad that out with some data.
Here,
with some cosmetic changes to the test code introduced in the video, is
a simple script to generate a tiny data set that we will use to pursue
the thought process of meeting a requirement:
create table t1
(as_of_date timestamp not null,
f_type varchar(50),
f_value numeric(23,10)
);
insert into t1 values('2022-03-31','KBS',1.0);
insert into t1 values('2022-06-30','SBS',1.1);
insert into t1 values('2022-08-31','MBC',1.2);
I have run a select statement ordering by AS_OF_DATE.
analdb=# select * from t1 order by 1;
+---------------------+--------+--------------+
| as_of_date | f_type | f_value |
+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | KBS | 1.0000000000 |
| 2022-06-30 00:00:00 | SBS | 1.1000000000 |
| 2022-08-31 00:00:00 | MBC | 1.2000000000 |
+---------------------+--------+--------------+
Requirements
We
have three rows, each one being the end day of the month. What we want
to do though is actually pad out the missing values. We can see that we
haven't got '2000-04-30', '2022-05-31', and '2022-07-30'. What we want
to do is to be able to run some sort of magical SQL statement, which
outputs the result set like this one:
+---------------------+--------+--------------+
| mth | f_type | f_value |
+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | KBS | 1.0000000000 |
| 2022-04-30 00:00:00 | KBS | 1.0000000000 |
| 2022-05-31 00:00:00 | KBS | 1.0000000000 |
| 2022-06-30 00:00:00 | SBS | 1.1000000000 |
| 2022-07-31 00:00:00 | SBS | 1.1000000000 |
| 2022-08-31 00:00:00 | MBC | 1.2000000000 |
+---------------------+--------+--------------+
The
output padded out the data and let the F_TYPE and F_VALUE columns
simply roll down as if those values were continuing throughout the
missing months.
Thought process
The question is how do we do it?
Before
reading on the subsequent paragraphs, try to write the query on your
own. If you are unclear about what the requirement is, please watch the
YouTube video first.
Here is how the thought process offered by Connor McDonald works to meet the requirements.
First
we have to find out what is the lowest value for the AS_OF_DATE and
what is the highst value for the AS_OF_DATE. So we can start with a very
simple query like the following:
SELECT MIN(AS_OF_DATE) LO,
MAX(AS_OF_DATE) HI
FROM T1;
+---------------------+---------------------+
| lo | hi |
+---------------------+---------------------+
| 2022-03-31 00:00:00 | 2022-08-31 00:00:00 |
+---------------------+---------------------+
This
output gives us the range as to what we are going to use. Because we
are building up our thought process we will simply refer to the output
from now on as a table called boundaries. We can do that just using a
WITH clause. So BOUNDARIES now becomes that information we just had and using the familiar
generate_series and DATE_TRUNC functions we can now generate the six
rows that we need to do filling in all the date gaps.
WITH BOUNDARIES AS (
SELECT MIN(AS_OF_DATE) LO, MAX(AS_OF_DATE) HI
FROM T1)
SELECT DATE_TRUNC('MONTH', generate_series(LO, HI, interval '1 month'))+ INTERVAL '1 month -1 day' AS MONTH
FROM BOUNDARIES;
+---------------------+
| month |
+---------------------+
| 2022-03-31 00:00:00 |
| 2022-04-30 00:00:00 |
| 2022-05-31 00:00:00 |
| 2022-06-30 00:00:00 |
| 2022-07-31 00:00:00 |
| 2022-08-31 00:00:00 |
+---------------------+
We have to use INTERVAL '1 month -1 day' to extract the last day of the month because generate_series(lo, hi, interval '1 month') does not retrieve the last day of the month. We can name the above output as ALL_MONTHS as well. Because ALL_MONTHS is the superset of the data in the T1 table, we can now
outer join it to the T1 table looking for all those AS_OF_DATE values
where they are present. So we can write the following code:
WITH BOUNDARIES AS (
SELECT MIN(AS_OF_DATE) LO,
MAX(AS_OF_DATE) HI
FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
FROM BOUNDARIES
)
SELECT B.MTH, A.AS_OF_DATE, A.F_TYPE, A.F_VALUE
FROM ALL_MONTHS B LEFT JOIN T1 A
ON B.MTH = A.AS_OF_DATE;
+---------------------+---------------------+--------+--------------+
| mth | as_of_date | f_type | f_value |
+---------------------+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | 2022-03-31 00:00:00 | KBS | 1.0000000000 |
| 2022-04-30 00:00:00 | | | |
| 2022-05-31 00:00:00 | | | |
| 2022-06-30 00:00:00 | 2022-06-30 00:00:00 | SBS | 1.1000000000 |
| 2022-07-31 00:00:00 | | | |
| 2022-08-31 00:00:00 | 2022-08-31 00:00:00 | MBC | 1.2000000000 |
+---------------------+---------------------+--------+--------------+
The
problem with the output is that we have got nulls. What we want to do
is to pick up the first values and roll them down to the next NOT NULL
values. Let's call the above output SPARSE_DATA. SPARSE_DATA has got all
the data we need except for those nulls which need to be filled in. In
order to fill in those nulls we could use the analytic function
LAST_VALUE(.. IGNORE NULLS) if we were on Oracle.
--The following query produces an error in PostgreSQL.
WITH BOUNDARIES AS (
SELECT MIN(AS_OF_DATE) LO,
MAX(AS_OF_DATE) HI
FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
FROM BOUNDARIES
),
SPARSE_DATA AS (
SELECT B.MTH, AS_OF_DATE, F_TYPE, F_VALUE
FROM ALL_MONTHS B
LEFT JOIN T1 A
ON A.AS_OF_DATE = B.MTH
)
SELECT MTH
, LAST_VALUE(f_type ignore nulls) OVER (order by mth) as f_type,
, LAST_VALUE(f_value ignore nulls) OVER (order by mth) as f_value
FROM SPARSE_DATA
ORDER BY 1;
Even though PostgreSQL does
not provide the IGNORE NULLS option in the LAST_VALUE() function, there
are several workarounds to implement the Oracle feature. Here is one of
those workarounds:
WITH BOUNDARIES AS (
SELECT MIN(AS_OF_DATE) LO,
MAX(AS_OF_DATE) HI
FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
FROM BOUNDARIES
),
SPARSE_DATA AS (
SELECT B.MTH, AS_OF_DATE, F_TYPE, F_VALUE
FROM ALL_MONTHS B
LEFT JOIN T1 A
ON A.AS_OF_DATE = B.MTH
)
SELECT MTH
, (SELECT F_TYPE
FROM SPARSE_DATA B
WHERE B.MTH <= A.MTH
AND B.F_TYPE IS NOT NULL
ORDER BY MTH DESC
FETCH NEXT 1 ROWS ONLY) AS F_TYPE
, (SELECT F_VALUE
FROM SPARSE_DATA B
WHERE B.MTH <= A.MTH
AND B.F_VALUE IS NOT NULL
ORDER BY MTH DESC
FETCH NEXT 1 ROWS ONLY) AS F_VALUE
FROM SPARSE_DATA A
ORDER BY 1;
In
the query I picked up the month and used a scalr subquery ignoring any
nulls and picking up the last information ordered by month. We get the
following output by running the query:
+---------------------+--------+--------------+
| mth | f_type | f_value |
+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | KBS | 1.0000000000 |
| 2022-04-30 00:00:00 | KBS | 1.0000000000 |
| 2022-05-31 00:00:00 | KBS | 1.0000000000 |
| 2022-06-30 00:00:00 | SBS | 1.1000000000 |
| 2022-07-31 00:00:00 | SBS | 1.1000000000 |
| 2022-08-31 00:00:00 | MBC | 1.2000000000 |
+---------------------+--------+--------------+
We are done. We have filled in the blanks with a dynamic range of months without having any nulls as well.
Conclusion
1. We can use the DATE_TRUNC and generate_series functions to fill in the date gaps.
2. We can use the scalar subquery or the lateral clause to get the functionality of LAST_VALUE(... IGNORE NULLS) of Oracle.
Footnote
The solution I presented in this note works and yields the expected resuts, but has two major disadvantages:
1. It is much more convoluted than the Oracle's version.
2. If the dataset is large, it is ineviable to see an hourglass.
You
could probably write this simpler and more compact using some more
advanced SQL features. I leave it as an exercise for you readers to find
out a better solution.
Addendum
When it comes to satisfying the requirements and building performance into your systems, it is important to think about performance up front.
But that does not mean that you have to over engineer everything. If
your query outputs the result set in one or two seconds, you'd better
stop putting in more effort. You don't have to have every single query
in your application be blazingly fast. You might be spending a lot of
effort and time for not much benefit. But in this small test code, it is
worth optimizing to the extreme for exercise in performance tuning.
Added on Nov 5, 2022
The following is the query using the lateral clause:
WITH RECURSIVE boundaries AS (
SELECT MIN(AS_OF_DATE) LO, MAX(AS_OF_DATE) HI
FROM T1
),
ALL_MONTHS AS (
SELECT LO as MTH FROM boundaries
UNION ALL
SELECT date_trunc('MONTH',a.MTH + interval '1 month') + '1 month - 1day'
FROM all_months a, boundaries b
WHERE a.MTH < b.HI
)
SELECT x.MTH, y.f_type, y.f_value
FROM all_months x
LEFT JOIN LATERAL (SELECT f_type, f_value
FROM T1
WHERE T1.as_of_date <= x.MTH
ORDER BY T1.as_of_date DESC
LIMIT 1
) y ON true;
Added on Dec 27, 2022
These days I am writing up "Oracle to PostgreSQL SQL Statement Conversion Guide". I've often made the point that you should avoid visiting a table twice in a query. So I re-enginnered the query using the COUNT() OVER () function:
Surely if the data set gets big, the following query outperforms the previous queries.
WITH BOUNDARIES AS (
SELECT MIN(AS_OF_DATE) LO,
MAX(AS_OF_DATE) HI
FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
FROM BOUNDARIES
),
SPARSE_DATA AS (
SELECT B.MTH, A.AS_OF_DATE, A.F_TYPE, A.F_VALUE
FROM ALL_MONTHS B LEFT JOIN T1 A
ON B.MTH = A.AS_OF_DATE
)
SELECT MTH
, FIRST_VALUE(AS_OF_DATE) OVER (PARTITION BY FLAG ORDER BY MTH) AS AS_OF_DATE
, FIRST_VALUE(F_TYPE) OVER (PARTITION BY FLAG ORDER BY MTH) AS F_TYPE
, FIRST_VALUE(F_VALUE) OVER (PARTITION BY FLAG ORDER BY MTH) AS F_VALUE
FROM (
SELECT MTH, AS_OF_DATE, F_TYPE, F_VALUE
, COUNT(CASE WHEN F_TYPE IS NOT NULL THEN F_TYPE END) OVER (ORDER BY MTH) AS FLAG
FROM SPARSE_DATA
) AS FOO;