설문조사
PostgreSQL/PPAS 관련 듣고 싶은 교육은


총 게시물 162건, 최근 0 건
   

Ignore Nulls with LAST_VALUE

글쓴이 : 모델광 날짜 : 2022-10-09 (일) 10:54 조회 : 1410

The following YouTube video prompted me to write this note.


https://www.youtube.com/watch?v=1p7gt7iu6jo&list=PLJMaoEWvHwFKP7uF4l1pXIqMEJ_RatDSq&index=18


Sometimes databases store sparse data, just the data of interest and sometimes we want to pad that out with some data.


Here, with some cosmetic changes to the test code introduced in the video, is a simple script to generate a tiny data set that we will use to pursue the thought process of meeting a requirement:


create table t1
(as_of_date    timestamp not null,
f_type         varchar(50),
f_value        numeric(23,10)
);
insert into t1 values('2022-03-31','KBS',1.0);
insert into t1 values('2022-06-30','SBS',1.1);
insert into t1 values('2022-08-31','MBC',1.2);


I have run a select statement ordering by AS_OF_DATE.


analdb=# select * from t1 order by 1;
+---------------------+--------+--------------+
| as_of_date          | f_type | f_value      |
+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | KBS    | 1.0000000000 |
| 2022-06-30 00:00:00 | SBS    | 1.1000000000 |
| 2022-08-31 00:00:00 | MBC    | 1.2000000000 |
+---------------------+--------+--------------+


Requirements

We have three rows, each one being the end day of the month. What we want to do though is actually pad out the missing values. We can see that we haven't got '2000-04-30', '2022-05-31', and '2022-07-30'. What we want to do is to be able to run some sort of magical SQL statement, which outputs the result set like this one:


+---------------------+--------+--------------+
| mth                 | f_type   | f_value      |
+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | KBS      | 1.0000000000 |
| 2022-04-30 00:00:00 | KBS      | 1.0000000000 |
| 2022-05-31 00:00:00 | KBS      | 1.0000000000 |
| 2022-06-30 00:00:00 | SBS      | 1.1000000000 |
| 2022-07-31 00:00:00 | SBS      | 1.1000000000 |
| 2022-08-31 00:00:00 | MBC     | 1.2000000000 |
+---------------------+--------+--------------+


The output padded out the data and let the F_TYPE and F_VALUE columns simply roll down as if those values were continuing throughout the missing months.


Thought process

The question is how do we do it?

Before reading on the subsequent paragraphs, try to write the query on your own. If you are unclear about what the requirement is, please watch the YouTube video first.


Here is how the thought process offered by Connor McDonald works to meet the requirements.


First we have to find out what is the lowest value for the AS_OF_DATE and what is the highst value for the AS_OF_DATE. So we can start with a very simple query like the following:


  SELECT MIN(AS_OF_DATE) LO,
           MAX(AS_OF_DATE) HI
     FROM T1;

+---------------------+---------------------+
| lo                  | hi                  |
+---------------------+---------------------+
| 2022-03-31 00:00:00 | 2022-08-31 00:00:00 |
+---------------------+---------------------+


This output gives us the range as to what we are going to use. Because we are building up our thought process we will simply refer to the output from now on as a table called boundaries. We can do that just using a WITH clause. So BOUNDARIES now becomes that information we just had and using the familiar generate_series and DATE_TRUNC functions we can now generate the six rows that we need to do filling in all the date gaps.


WITH BOUNDARIES AS (
   SELECT MIN(AS_OF_DATE) LO,  MAX(AS_OF_DATE) HI
      FROM T1)
SELECT DATE_TRUNC('MONTH', generate_series(LO, HI, interval '1 month'))+ INTERVAL '1 month -1 day' AS MONTH
  FROM BOUNDARIES;

+---------------------+
| month               |
+---------------------+
| 2022-03-31 00:00:00 |
| 2022-04-30 00:00:00 |
| 2022-05-31 00:00:00 |
| 2022-06-30 00:00:00 |
| 2022-07-31 00:00:00 |
| 2022-08-31 00:00:00 |
+---------------------+


We have to use INTERVAL '1 month -1 day' to extract the last day of the month because generate_series(lo, hi, interval '1 month') does not retrieve the last day of the month. We can name the above output as ALL_MONTHS as well. Because ALL_MONTHS is the superset of the data in the T1 table, we can now outer join it to the T1 table looking for all those AS_OF_DATE values where they are present. So we can write the following code:


WITH BOUNDARIES AS (
  SELECT MIN(AS_OF_DATE) LO,
         MAX(AS_OF_DATE) HI
     FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
  FROM BOUNDARIES
  )
SELECT B.MTH, A.AS_OF_DATE, A.F_TYPE, A.F_VALUE
  FROM ALL_MONTHS B LEFT JOIN T1 A
    ON B.MTH = A.AS_OF_DATE;


+---------------------+---------------------+--------+--------------+
| mth                             | as_of_date                 | f_type | f_value      |
+---------------------+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | 2022-03-31 00:00:00 | KBS    | 1.0000000000 |
| 2022-04-30 00:00:00 |                                    |              |                      |
| 2022-05-31 00:00:00 |                                    |              |                      |
| 2022-06-30 00:00:00 | 2022-06-30 00:00:00 | SBS    | 1.1000000000 |
| 2022-07-31 00:00:00 |                                    |              |                      |
| 2022-08-31 00:00:00 | 2022-08-31 00:00:00 | MBC    | 1.2000000000 |
+---------------------+---------------------+--------+--------------+


The problem with the output is that we have got nulls. What we want to do is to pick up the first values and roll them down to the next NOT NULL values. Let's call the above output SPARSE_DATA. SPARSE_DATA has got all the data we need except for those nulls which need to be filled in. In order to fill in those nulls we could use the analytic function LAST_VALUE(.. IGNORE NULLS) if we were on Oracle.


--The following query produces an error in PostgreSQL.

WITH BOUNDARIES AS (
  SELECT MIN(AS_OF_DATE) LO,
         MAX(AS_OF_DATE) HI
     FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
  FROM BOUNDARIES
  ),
SPARSE_DATA AS (
SELECT B.MTH, AS_OF_DATE, F_TYPE, F_VALUE
  FROM ALL_MONTHS B
  LEFT JOIN T1 A
    ON A.AS_OF_DATE = B.MTH
)

SELECT MTH

          , LAST_VALUE(f_type ignore nulls) OVER (order by mth) as f_type,

          , LAST_VALUE(f_value ignore nulls) OVER (order by mth) as f_value

   FROM SPARSE_DATA

ORDER BY 1;


Even though PostgreSQL does not provide the IGNORE NULLS option in the LAST_VALUE() function, there are several workarounds to implement the Oracle feature. Here is one of those workarounds:


WITH BOUNDARIES AS (
  SELECT MIN(AS_OF_DATE) LO,
         MAX(AS_OF_DATE) HI
     FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
  FROM BOUNDARIES
  ),
SPARSE_DATA AS (
SELECT B.MTH, AS_OF_DATE, F_TYPE, F_VALUE
  FROM ALL_MONTHS B
  LEFT JOIN T1 A
    ON A.AS_OF_DATE = B.MTH
)
SELECT MTH
        , (SELECT F_TYPE
            FROM SPARSE_DATA B
           WHERE B.MTH <= A.MTH
               AND B.F_TYPE IS NOT NULL 
          ORDER BY MTH DESC
         FETCH NEXT 1 ROWS ONLY) AS F_TYPE
        , (SELECT F_VALUE
           FROM SPARSE_DATA B
          WHERE B.MTH <= A.MTH
              AND B.F_VALUE IS NOT NULL
          ORDER BY MTH DESC
        FETCH NEXT 1 ROWS ONLY) AS F_VALUE
  FROM SPARSE_DATA A
ORDER BY 1;


In the query I picked up the month and used a scalr subquery ignoring any nulls and picking up the last information ordered by month. We get the following output by running the query:


+---------------------+--------+--------------+
| mth                 | f_type | f_value      |
+---------------------+--------+--------------+
| 2022-03-31 00:00:00 | KBS    | 1.0000000000 |
| 2022-04-30 00:00:00 | KBS    | 1.0000000000 |
| 2022-05-31 00:00:00 | KBS    | 1.0000000000 |
| 2022-06-30 00:00:00 | SBS    | 1.1000000000 |
| 2022-07-31 00:00:00 | SBS    | 1.1000000000 |
| 2022-08-31 00:00:00 | MBC    | 1.2000000000 |
+---------------------+--------+--------------+


We are done. We have filled in the blanks with a dynamic range of months without having any nulls as well.


Conclusion

1. We can use the DATE_TRUNC and generate_series functions to fill in the date gaps.

2. We can use the scalar subquery or the lateral clause to get the functionality of LAST_VALUE(... IGNORE NULLS) of Oracle.


Footnote

The solution I presented in this note works and yields the expected resuts, but has two major disadvantages:

1. It is much more convoluted than the Oracle's version.

2. If the dataset is large, it is ineviable to see an hourglass.

You could probably write this simpler and more compact using some more advanced SQL features. I leave it as an exercise for you readers to find out a better solution.


Addendum

When it comes to satisfying the requirements and building performance into your systems, it is important to think about performance up front. But that does not mean that you have to over engineer everything. If your query outputs the result set in one or two seconds, you'd better stop putting in more effort. You don't have to have every single query in your application be blazingly fast. You might be spending a lot of effort and time for not much benefit. But in this small test code, it is worth optimizing to the extreme for exercise in performance tuning.


Added on Nov 5, 2022


The following is the query using the lateral clause:


WITH RECURSIVE boundaries AS (
  SELECT MIN(AS_OF_DATE) LO,  MAX(AS_OF_DATE) HI
      FROM T1
  ),
ALL_MONTHS AS (
  SELECT LO as MTH FROM boundaries
  UNION ALL
  SELECT date_trunc('MONTH',a.MTH + interval '1 month') + '1 month - 1day'
    FROM all_months a, boundaries b
   WHERE a.MTH < b.HI
  )
SELECT x.MTH, y.f_type, y.f_value
  FROM all_months x
  LEFT JOIN LATERAL (SELECT f_type, f_value
                                  FROM T1
                                WHERE T1.as_of_date <= x.MTH
                              ORDER BY T1.as_of_date DESC
                                LIMIT 1
                              ) y ON true;


Added on Dec 27, 2022

These days I am writing up "Oracle to PostgreSQL SQL Statement Conversion Guide". I've often made the point that you should avoid visiting a table twice in a query. So I re-enginnered the query using the COUNT() OVER () function:
Surely if the data set gets big, the following query outperforms the previous queries.

WITH BOUNDARIES AS (
  SELECT MIN(AS_OF_DATE) LO,
         MAX(AS_OF_DATE) HI
     FROM T1),
ALL_MONTHS AS (
SELECT DATE_TRUNC('MONTH',generate_series(lo, hi, interval '1 month'))+INTERVAL '1 MONTH -1 DAY' AS MTH
  FROM BOUNDARIES
  ),
SPARSE_DATA AS (
SELECT B.MTH, A.AS_OF_DATE, A.F_TYPE, A.F_VALUE
  FROM ALL_MONTHS B LEFT JOIN T1 A
    ON B.MTH = A.AS_OF_DATE
)
SELECT MTH
     , FIRST_VALUE(AS_OF_DATE) OVER (PARTITION BY FLAG ORDER BY MTH) AS AS_OF_DATE
     , FIRST_VALUE(F_TYPE)     OVER (PARTITION BY FLAG ORDER BY MTH) AS F_TYPE
     , FIRST_VALUE(F_VALUE)    OVER (PARTITION BY FLAG ORDER BY MTH) AS F_VALUE
  FROM (
     SELECT MTH, AS_OF_DATE, F_TYPE, F_VALUE
          , COUNT(CASE WHEN F_TYPE IS NOT NULL THEN F_TYPE END) OVER (ORDER BY MTH) AS FLAG
       FROM SPARSE_DATA
       ) AS FOO;


   

postgresdba.com