I have a spark data frame like below:
+---+------------+
| id| vectors|
+---+------------+
| 0| [1, 2, 3]|
| 1|[2, 4, null]|
| 2|[3, 2, null]|
+---+------------+
I want to do a cartesian join. But the problem with that is it adds the same column name of the right table. I want to use the cartesian joined table to calculate the cosine similarity of the two vector columns of joined table. But if the column names are same I won't be able to reference them in the cosine similarity function.
So I tried this.
joined_df=data_with_keys_df.join(data_with_keys_df.withColumnRenamed('vectors','vectors_2'))
And got the right table vector column name changed.
+---+------------+---+------------+
| id| vectors| id| vectors_2|
+---+------------+---+------------+
| 0| [1, 2, 3]| 0| [1, 2, 3]|
| 0| [1, 2, 3]| 1|[2, 4, null]|
| 0| [1, 2, 3]| 2|[3, 2, null]|
| 1|[2, 4, null]| 0| [1, 2, 3]|
| 1|[2, 4, null]| 1|[2, 4, null]|
| 1|[2, 4, null]| 2|[3, 2, null]|
| 2|[3, 2, null]| 0| [1, 2, 3]|
| 2|[3, 2, null]| 1|[2, 4, null]|
| 2|[3, 2, null]| 2|[3, 2, null]|
+---+------------+---+------------+
Now I can use a Python Udf to create a new column which would have cosine similarity of the two vector column above.
However, I still have the same id column. I don't want to use withColumnRenamed multiple times for each table before joining. So is there a way I can rename it while joining?.
I know doing it is sql query would be one such way, but that would require two additional steps of registering the data frame as temptable.
Can this be done in DataFrame syntax itself or sql is the easiest way (below)?
select a.id as id1, a.vector as vector1, b.id as id2, b.vector as vector2
from df, df