-1
SELECT *
FROM dim_date



ds
2018-01-01
2018-01-02
2018-01-03
...
...

SELECT *
FROM "fact_metrics$partitions"



ds
2018-01-04
2018-01-05
2018-01-06
...
...
2018-08-23
...
...

Suppose *fact_metrics *has a data pipeline that is scheduled daily and normally creates a ds partition every day on successful completion. However, if the pipeline fails on certain days, there is no partition created for that day. On any given day, the pipeline runs independent of its status from the prior day i.e. each daily run is independent of all other daily runs. This means failed runs will leave gaps in the date partitions reported by the query above

Write a query that will generate a report similar to the following for the dataset fact_metrics for the period 2018-01-01 to 2018-08-31

----------------------------------------------
partition_flag   |  start_date | end_date
----------------------------------------------
missing          |  2018-01-01 | 2018-01-03
present          |  2018-01-04 | 2018-08-23
missing          |  2018-08-24 | 2018-08-31

Supposedly I need to use the lag and lead functions but I am not sure how to make the intervals non-overlapping though

select  
      "Missing" as partition_flag,
      lag(dt.ds) over ( order by dt.ds) as Start_date,
      lead(dt.ds) over( order by dt.ds) as end_date
from   dim_date dt
    left join fact_metrics_partitions ft on dt.ds = ft.ds
where ft.ds is null
union
select  "present" as partition_flag,
        lag(dt.ds) over ( order by dt.ds) as Start_date,
      lead(dt.ds) over( order by dt.ds) as end_date
from   dim_date dt
    inner join fact_metrics_partitions ft on dt.ds = ft.ds
Gordon Linoff
  • 1,122,135
  • 50
  • 484
  • 624
SelectStarFrom
  • 101
  • 1
  • 1
  • 6

1 Answers1

0

First I create an auxiliar table naming the flags as Missing or Present, using Left join. I named it table_aux.

CREATE TABLE table_aux AS
SELECT d.ds as dds
     , f.ds as fds
     , CASE WHEN f.ds is NULL THEN 'missing'
       ELSE 'present' END AS partition_flag
FROM dim_date d
LEFT JOIN fact_metrics$partitions f ON d.ds=f.ds

So we would have:

dds         |  fds        | partition_flag
2018-01-01  | null        | missing
2018-01-02  | null        | missing
2018-01-03  | null        | missing
2018-01-04  | 2018-01-04  | present
2018-01-05  | 2018-01-05  | present
2018-01-06  | 2018-01-06  | present
2018-01-07  | null        | missing
2018-01-08  | null        | missing
2018-01-09  | null        | missing
2018-01-10  | null        | missing

Afterwars, using table_aux, I would create a new auxiliar table, just to get te lag of partition_flag. I named it lags.

CREATE TABLE lags AS
SELECT dds as date
     , partition_flag
     , lag(partition_flag,1) over (order by dds) as lag_flag

FROM table_aux

Having the output:

dds         | partition_flag | lag_flag
2018-01-01  | missing        | null
2018-01-02  | missing        | missing
2018-01-03  | missing        | missing
2018-01-04  | present        | missing
2018-01-05  | present        | present
2018-01-06  | present        | present
2018-01-07  | missing        | present
2018-01-08  | missing        | missing
2018-01-09  | missing        | missing
2018-01-10  | missing        | missing

Now we filter this table getting only the lines where the previous flag differs from the actual. And get what is the lead date from the remaining lines:

CREATE TABLE change_of_flags AS
SELECT partition_flag
     , date
     , lead(date) over (order by date) as lead_date
FROM lags
WHERE lag_flag != partition_flag

The result would be:

dds         | partition_flag | lead_date
2018-01-01  | missing        | 2018-01-04
2018-01-04  | present        | 2018-01-07
2018-01-07  | missing        | null

Ok, now we have the start date of the flag, but we have the end day ahead for onde day so we just take this one day off:

SELECT date as start_date
     , dateadd(day, -1,lead_date) as end_date
     , partition_flag
FROM change_of_flags

In this case I am assuming you have the dateadd function where you are queryng, but any date function that would subtract one day from your query will work just fine. The result is as following:

start_date  | end_date   | partition_flag
2018-01-01  | 2018-01-03 | missing        
2018-01-04  | 2018-01-06 | present        
2018-01-07  | null       | missing      

The whole query is:

with table_aux as (
SELECT d.ds as dds
     , f.ds as fds
     , CASE WHEN f.ds is NULL THEN 'missing'
       ELSE 'present' END AS partition_flag
FROM dim_date d
LEFT JOIN fact_metrics$partitions f ON d.ds=f.ds
),

lags as (
SELECT dds as date
     , partition_flag
     , lag(partition_flag,1) over (order by dds) as lag_flag

FROM table_aux
),

change_of_flags(
SELECT partition_flag
     , date
     , lead(date) over (order by date) as lead_date
FROM lags
WHERE lag_flag != partition_flag
)

SELECT date as start_date
     , dateadd(day, -1,lead_date) as end_date
     , partition_flag
FROM change_of_flags
Mariane Reis
  • 371
  • 1
  • 3
  • 13