build up a method call to filter a pyspark dataframe dynamically

Question

Premise

I have an ETL pipeline where I'm upserting certain partitions on large fact tables (300-400 mil rows), for simplicity sake this is my dataframe.

display(delta_df)
id   name  age  salary last_modified_date  Year  Month
1   John   30    2000         2019-06-01  2019      6  #this should stay.
2   Peter   35    1500        2018-08-02  2018      9  #duplicate record will be removed after union.

Year and Month are my hive partition columns.

this is my complete fact table.

display(fact_df)
   id   name  age  salary last_modified_date  Year  Month
   1   John   30    1000         2019-05-01  2019      6 # this should stay.
   2   Peter  35    1500         2018-08-02  2018      9 # duplicate record.
   3   Gabe   21     800         2015-02-03  2015      2 # this row should be filtered out. 
   4   Oscar  29    2000         2020-05-04  2020      6 # this row should be filtered out. 
   5   Anna   20    1200         2010-11-05  2018      9 # this should stay.

Problem

now before doing a union and row_number to de-dupe the data and apply any business logic i want to only read in the partitions that exist in my first data frame.

I know I can do this quite manually by using isin method calls.

However as this is apart of an ETL pipeline I need to make this dynamic

fact_df.filter(col('Year').isin(delta_df.select('Year').distinct().collect() & 
                                delta_df.select('Month').distinct().collect() )

I've attempted to create a unpack a dictionary then pass it in but i can't figure out how to chain the &

from pyspark.sql.functions import col

[col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()]
out:
[Column<b'(Year IN (2019, 2020))'>, Column<b'(Month IN (4, 5))'>]

I guess I could use eval and build up the code as a string but it feels wrong, and possibly dangerous and tedious when dealing with datetime objects.

fact_df.filter(
        eval(
          ('&'.join([f"col('{k}').isin({v})" for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()] )
          ) ))

Question

How can I safely filter my data frame based off dynamic variables that will be available at execution.

this dataset has 'Year', 'Month' but another may have 'Year', 'Month', 'DayofYear' and 'PostalDistrict'

mck · Accepted Answer · 2020-12-30T14:27:48.973

2

You can use reduce:

from functools import reduce

reduce(lambda a, b: a & b, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

# or if you want to do it with style...
from operator import and_
reduce(and_, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

PS another angle to look at this - could a semi-join work?

fact_df.join(delta_df, ['Year', 'Month'], 'semi')

edited Dec 30 '20 at 14:27

answered Dec 30 '20 at 14:23

mck

33,250
12
23
39

thank you, the reduce looks promising and the `semi` join looks really clean. Could you explain how semi works? – Umar.H Dec 30 '20 at 14:27
semi join keeps the rows in the left table where the join condition is met. There is also anti-join which keeps the rows where the join condition is not met. – mck Dec 30 '20 at 14:27
assuming that the `on` clause is missing the `primary_key` for this table, would this not cause a product? – Umar.H Dec 30 '20 at 14:28
No, it's slightly different from an inner join, it's not really a join for what it's worth – mck Dec 30 '20 at 14:29
1

See the bottom of this link for details: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html – mck Dec 30 '20 at 14:29
no rows will be duplicated in a semi/anti join – mck Dec 30 '20 at 14:30
realised my error in what I was asking and that my dynamic method would of course return more rows than intended `left_semi` + `union` is what I needed – Umar.H Dec 30 '20 at 16:13

build up a method call to filter a pyspark dataframe dynamically

Premise

Problem

Question

1 Answers1