Renaming the Columns in Self-Join of data frame in Spark

Question

I have a spark data frame like below:

+---+------------+
| id|     vectors|
+---+------------+
|  0|   [1, 2, 3]|
|  1|[2, 4, null]|
|  2|[3, 2, null]|
+---+------------+

I want to do a cartesian join. But the problem with that is it adds the same column name of the right table. I want to use the cartesian joined table to calculate the cosine similarity of the two vector columns of joined table. But if the column names are same I won't be able to reference them in the cosine similarity function.

So I tried this.

joined_df=data_with_keys_df.join(data_with_keys_df.withColumnRenamed('vectors','vectors_2'))

And got the right table vector column name changed.

+---+------------+---+------------+
| id|     vectors| id|   vectors_2|
+---+------------+---+------------+
|  0|   [1, 2, 3]|  0|   [1, 2, 3]|
|  0|   [1, 2, 3]|  1|[2, 4, null]|
|  0|   [1, 2, 3]|  2|[3, 2, null]|
|  1|[2, 4, null]|  0|   [1, 2, 3]|
|  1|[2, 4, null]|  1|[2, 4, null]|
|  1|[2, 4, null]|  2|[3, 2, null]|
|  2|[3, 2, null]|  0|   [1, 2, 3]|
|  2|[3, 2, null]|  1|[2, 4, null]|
|  2|[3, 2, null]|  2|[3, 2, null]|
+---+------------+---+------------+

Now I can use a Python Udf to create a new column which would have cosine similarity of the two vector column above.

However, I still have the same id column. I don't want to use withColumnRenamed multiple times for each table before joining. So is there a way I can rename it while joining?.

I know doing it is sql query would be one such way, but that would require two additional steps of registering the data frame as temptable.

Can this be done in DataFrame syntax itself or sql is the easiest way (below)?

select a.id as id1, a.vector as vector1, b.id as id2, b.vector as vector2
from df, df

I'm not sure that I understand your question quite well. Would you care reformulating the question ? — eliasah, Jan 06 '17 at 09:32
withColumnRenamed is the way which does what your final query achieves by the alias ("as") statement. If you want to compute a similarity-join using cosine simimlarity on bigger data, consider using DIMSUM, if your data meets the requirements. — Elmar Macek, Jan 06 '17 at 15:31
Thanks to @alberto-bonsanto http://stackoverflow.com/a/34077809/6626530 check this answer for multiple options for renaming data frame columns — Shijo, Jan 06 '17 at 17:04

Renaming the Columns in Self-Join of data frame in Spark

0 Answers0