SELECT *
FROM dim_date
ds
2018-01-01
2018-01-02
2018-01-03
...
...
SELECT *
FROM "fact_metrics$partitions"
ds
2018-01-04
2018-01-05
2018-01-06
...
...
2018-08-23
...
...
Suppose *fact_metrics *has a data pipeline that is scheduled daily and normally creates a ds partition every day on successful completion. However, if the pipeline fails on certain days, there is no partition created for that day. On any given day, the pipeline runs independent of its status from the prior day i.e. each daily run is independent of all other daily runs. This means failed runs will leave gaps in the date partitions reported by the query above
Write a query that will generate a report similar to the following for the dataset fact_metrics for the period 2018-01-01 to 2018-08-31
----------------------------------------------
partition_flag | start_date | end_date
----------------------------------------------
missing | 2018-01-01 | 2018-01-03
present | 2018-01-04 | 2018-08-23
missing | 2018-08-24 | 2018-08-31
Supposedly I need to use the lag and lead functions but I am not sure how to make the intervals non-overlapping though
select
"Missing" as partition_flag,
lag(dt.ds) over ( order by dt.ds) as Start_date,
lead(dt.ds) over( order by dt.ds) as end_date
from dim_date dt
left join fact_metrics_partitions ft on dt.ds = ft.ds
where ft.ds is null
union
select "present" as partition_flag,
lag(dt.ds) over ( order by dt.ds) as Start_date,
lead(dt.ds) over( order by dt.ds) as end_date
from dim_date dt
inner join fact_metrics_partitions ft on dt.ds = ft.ds