I have a piece of Java code using Apache Spark to join two dataframes with a conditional that relies on a VM argument -DearlyData=TRUE for an inner join, and -DearlyData=FALSE for a leftanti join depending on whether the VM argument is set to TRUE or FALSE (Technically, if it is set to TRUE or any other value.)

This is a simplified version of my code:


String earlyData = System.getProperty(Constants.EARLY_DATA);
        log.trace("Running Early Data");        
                        drop(Constants.AA, Constants.CC));
        log.trace("Running Late Data");
                    .and(earlyDF.col(CC).equalTo(example.col(DD))), "leftanti")
                        .drop(Constants.AA, Constants.CC));


My code works, but my question is this:

  • Should I use an Environment Variable or a VM Argument for the String earlyData?
  • Are there drawbacks or unforeseen complications of using one versus the other in a conditional like this?
  • 196
  • 5
  • 14

1 Answers1


Based on the information provided here by user Jose Martinez, a VM argument is correct for this use case.

To elaborate, I have a Cron that kicks off the inner join in the morning by having -DearlyData=TRUE to retrieve early data, and a Cron that utilizes the leftanti join in the evening by using -DearlyData=FALSE in the script for late data.

  • 196
  • 5
  • 14