0

Premise

I have an ETL pipeline where I'm upserting certain partitions on large fact tables (300-400 mil rows), for simplicity sake this is my dataframe.

display(delta_df)
id   name  age  salary last_modified_date  Year  Month
1   John   30    2000         2019-06-01  2019      6  #this should stay.
2   Peter   35    1500        2018-08-02  2018      9  #duplicate record will be removed after union.

Year and Month are my hive partition columns.

this is my complete fact table.

display(fact_df)
   id   name  age  salary last_modified_date  Year  Month
   1   John   30    1000         2019-05-01  2019      6 # this should stay.
   2   Peter  35    1500         2018-08-02  2018      9 # duplicate record.
   3   Gabe   21     800         2015-02-03  2015      2 # this row should be filtered out. 
   4   Oscar  29    2000         2020-05-04  2020      6 # this row should be filtered out. 
   5   Anna   20    1200         2010-11-05  2018      9 # this should stay. 

Problem

now before doing a union and row_number to de-dupe the data and apply any business logic i want to only read in the partitions that exist in my first data frame.

I know I can do this quite manually by using isin method calls.

However as this is apart of an ETL pipeline I need to make this dynamic

fact_df.filter(col('Year').isin(delta_df.select('Year').distinct().collect() & 
                                delta_df.select('Month').distinct().collect() )

I've attempted to create a unpack a dictionary then pass it in but i can't figure out how to chain the &

from pyspark.sql.functions import col

[col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()]
out:
[Column<b'(Year IN (2019, 2020))'>, Column<b'(Month IN (4, 5))'>]

I guess I could use eval and build up the code as a string but it feels wrong, and possibly dangerous and tedious when dealing with datetime objects.

fact_df.filter(
        eval(
          ('&'.join([f"col('{k}').isin({v})" for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()] )
          ) ))

Question

How can I safely filter my data frame based off dynamic variables that will be available at execution.

this dataset has 'Year', 'Month' but another may have 'Year', 'Month', 'DayofYear' and 'PostalDistrict'

Alex Ott
  • 49,058
  • 5
  • 62
  • 91
Umar.H
  • 18,427
  • 4
  • 26
  • 52

1 Answers1

2

You can use reduce:

from functools import reduce

reduce(lambda a, b: a & b, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

# or if you want to do it with style...
from operator import and_
reduce(and_, [col(k).isin(v) for k,v in {'Year' : [2019,2020], 'Month' : [4,5]}.items()])

PS another angle to look at this - could a semi-join work?

fact_df.join(delta_df, ['Year', 'Month'], 'semi')
mck
  • 33,250
  • 12
  • 23
  • 39
  • thank you, the reduce looks promising and the `semi` join looks really clean. Could you explain how semi works? – Umar.H Dec 30 '20 at 14:27
  • semi join keeps the rows in the left table where the join condition is met. There is also anti-join which keeps the rows where the join condition is not met. – mck Dec 30 '20 at 14:27
  • assuming that the `on` clause is missing the `primary_key` for this table, would this not cause a product? – Umar.H Dec 30 '20 at 14:28
  • No, it's slightly different from an inner join, it's not really a join for what it's worth – mck Dec 30 '20 at 14:29
  • 1
    See the bottom of this link for details: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html – mck Dec 30 '20 at 14:29
  • no rows will be duplicated in a semi/anti join – mck Dec 30 '20 at 14:30
  • realised my error in what I was asking and that my dynamic method would of course return more rows than intended `left_semi` + `union` is what I needed – Umar.H Dec 30 '20 at 16:13