Premise
I have an ETL pipeline where I'm upserting certain partitions on large fact tables (300-400 mil rows), for simplicity sake this is my dataframe.
display(delta_df)
id name age salary last_modified_date Year Month
1 John 30 2000 2019-06-01 2019 6 #this should stay.
2 Peter 35 1500 2018-08-02 2018 9 #duplicate record will be removed after union.
Year
and Month
are my hive partition columns.
this is my complete fact table.
display(fact_df)
id name age salary last_modified_date Year Month
1 John 30 1000 2019-05-01 2019 6 # this should stay.
2 Peter 35 1500 2018-08-02 2018 9 # duplicate record.
3 Gabe 21 800 2015-02-03 2015 2 # this row should be filtered out.
4 Oscar 29 2000 2020-05-04 2020 6 # this row should be filtered out.
5 Anna 20 1200 2010-11-05 2018 9 # this should stay.
Problem
now before doing a union
and row_number
to de-dupe the data and apply any business logic i want to only read in the partitions that exist in my first data frame.
I know I can do this quite manually by using isin
method calls.
However as this is apart of an ETL pipeline I need to make this dynamic
fact_df.filter(col('Year').isin(delta_df.select('Year').distinct().collect() &
delta_df.select('Month').distinct().collect() )
I've attempted to create a unpack a dictionary then pass it in but i can't figure out how to chain the &
from pyspark.sql.functions import col
[col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()]
out:
[Column<b'(Year IN (2019, 2020))'>, Column<b'(Month IN (4, 5))'>]
I guess I could use eval
and build up the code as a string but it feels wrong, and possibly dangerous and tedious when dealing with datetime objects.
fact_df.filter(
eval(
('&'.join([f"col('{k}').isin({v})" for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()] )
) ))
Question
How can I safely filter my data frame based off dynamic variables that will be available at execution.
this dataset has 'Year', 'Month' but another may have 'Year', 'Month', 'DayofYear' and 'PostalDistrict'