spark.sql.crossJoin.enabled for Spark 2.x

Question

I am using the 'preview' Google DataProc Image 1.1 with Spark 2.0.0. To complete one of my operations I have to complete a cartesian product. Since version 2.0.0 there has been a spark configuration parameter created (spark.sql.cross Join.enabled) that prohibits cartesian products and an Exception is thrown. How can I set spark.sql.crossJoin.enabled=true, preferably by using an initialization action? spark.sql.crossJoin.enabled=true

score 27 · Answer 1 · edited Sep 11 '19 at 11:09

27

Spark >= 3.0

spark.sql.crossJoin.enable is true by default (SPARK-28621).

Spark >= 2.1

You can use crossJoin:

df1.crossJoin(df2)

It makes your intention explicit and keeps more conservative configuration in place to protect you from unintended cross joins.

Spark 2.0

SQL properties can be set dynamically on runtime with RuntimeConfig.set method so you should be able to call

spark.conf.set("spark.sql.crossJoin.enabled", true)

whenever you want to explicitly allow Cartesian product.

edited Sep 11 '19 at 11:09

10465355

4,108
2
13
36

answered Aug 17 '16 at 14:56

zero323

283,404
79
858
880

1

It looks like `crossJoin()` isn't available on `DataFrame`/`Dataset` prior to spark 2.1. – Rick Haffey Jan 11 '17 at 20:41
1

@RickHaffey for versions prior to Spark 2.1, use the `dataset.join(rightDataset)` API with the `spark.conf.set("spark.sql.crossJoin.enabled", true)` configuration option. This style also works with Spark 2.1, but the .crossJoin API is ideal as it's more explicit. – Garren S Mar 02 '17 at 18:23
1

if you are working in iPython `true` should be `True` – foxyblue May 18 '17 at 13:34

score 5 · Accepted Answer · answered Aug 17 '16 at 15:32

For changing default values of configuration settings in Dataproc, you don't even need an init action, you can use the --properties flag when creating your cluster from the command-line:

gcloud dataproc clusters create --properties spark:spark.sql.crossJoin.enabled=true my-cluster ...

score 1 · Answer 3 · answered Feb 01 '17 at 00:35

The TPCDS query set benchmarks have queries that contain CROSS JOINS and unless you explicitly write CROSS JOIN or dynamically set Spark's default property to true Spark.conf.set("spark.sql.crossJoin.enabled", true) you will run into an exception error.

The error appears on TPCDS queries 28,61, 88, and 90 becuase the original query syntax from Transaction Processing Committee (TPC) contains commas and Spark's default join operation is an inner join. My team has also decided to use CROSS JOIN in lieu of changing Spark's default properties.

score 0 · Answer 4 · edited Mar 27 '20 at 11:46

0

I think it should be

spark.conf.set("spark.sql.crossJoin.enabled", True)

Otherwise it'll give

NameError: name 'true' is not defined

edited Mar 27 '20 at 11:46

David Buck

3,439
29
24
31

answered Mar 27 '20 at 09:41

lokesh

21
3

spark.sql.crossJoin.enabled for Spark 2.x

4 Answers4

Linked

Related