Drop duplicate column with same values from spark dataframe

Question

Code:

import sparkSession.sqlContext.implicits._
val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")
table_df.show(false)

Input:

+---+------+---+
|ID |Weight|ID |
+---+------+---+
|1  |20    |1  |
|2  |200   |2  |
|3  |222   |3  |
|4  |2123  |4  |
|5  |2321  |5  |
+---+------+---+

Expected Output:

+---+------+
|ID |Weight|
+---+------+
|1  |20    |
|2  |200   |
|3  |222   |
|4  |2123  |
|5  |2321  |
+---+------+

I am using drop in following way

table_df.drop("ID").show(false)

This dropping both of the "ID" columns. How can I drop duplicated second column "ID" here?

you could rename the column first and then drop the newly named column. — mike, Jul 16 '20 at 12:20

score 2 · Answer 1 · answered Jul 16 '20 at 13:39

You may use the Dataframe map method to trim the duplicate ID column as given below,

table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").show() 


+---+------+
| ID|Weight|
+---+------+
|  1|    20|
|  2|   200|
|  3|   222|
|  4|  2123|
|  5|  2321|
+---+------+

New schema will be as below,

table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").schema.treeString

root
 |-- ID: integer (nullable = false)
 |-- Weight: integer (nullable = false)

score 0 · Answer 2 · answered Jul 16 '20 at 21:44

You can drop the column after renaming the particular instance, that you intend to drop, of this column.

Sample code meeting this requirement -

val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")

val newColNames = Seq("ID","Weight","X1")

table_df.toDF(newColNames:_*).show(false)
+---+------+---+
|ID |Weight|X1 |
+---+------+---+
|1  |20    |1  |
|2  |200   |2  |
|3  |222   |3  |
|4  |2123  |4  |
|5  |2321  |5  |
+---+------+---+


table_df.toDF(newColNames:_*).drop("X1").show(false)
+---+------+
|ID |Weight|
+---+------+
|1  |20    |
|2  |200   |
|3  |222   |
|4  |2123  |
|5  |2321  |
+---+------+

Drop duplicate column with same values from spark dataframe

2 Answers2