0

I have a piece of Java code using Apache Spark to join two dataframes with a conditional that relies on a VM argument -DearlyData=TRUE for an inner join, and -DearlyData=FALSE for a leftanti join depending on whether the VM argument is set to TRUE or FALSE (Technically, if it is set to TRUE or any other value.)

This is a simplified version of my code:

``

String earlyData = System.getProperty(Constants.EARLY_DATA);
    if(earlyData.equalsIgnoreCase("TRUE")){
        log.trace("Running Early Data");        
        DataBo.processData(earlyDF.join(cassandraDF, 
                earlyDF.col(AA).equalTo(example.col(BB))
                    .and(earlyDF.col(CC).equalTo(example.col(DD))),"inner")
                        drop(Constants.AA, Constants.CC));
    }else{
        log.trace("Running Late Data");
            DataBo.processData(earlyDF.join(cassandraDF, 
                earlyDF.col(AA).equalTo(example.col(BB))
                    .and(earlyDF.col(CC).equalTo(example.col(DD))), "leftanti")
                        .drop(Constants.AA, Constants.CC));

``

My code works, but my question is this:

  • Should I use an Environment Variable or a VM Argument for the String earlyData?
  • Are there drawbacks or unforeseen complications of using one versus the other in a conditional like this?
Jeremy
  • 196
  • 5
  • 14

1 Answers1

0

Based on the information provided here by user Jose Martinez, a VM argument is correct for this use case.

To elaborate, I have a Cron that kicks off the inner join in the morning by having -DearlyData=TRUE to retrieve early data, and a Cron that utilizes the leftanti join in the evening by using -DearlyData=FALSE in the script for late data.

Jeremy
  • 196
  • 5
  • 14