1

Code:

import sparkSession.sqlContext.implicits._
val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")
table_df.show(false)

Input:

+---+------+---+
|ID |Weight|ID |
+---+------+---+
|1  |20    |1  |
|2  |200   |2  |
|3  |222   |3  |
|4  |2123  |4  |
|5  |2321  |5  |
+---+------+---+

Expected Output:

+---+------+
|ID |Weight|
+---+------+
|1  |20    |
|2  |200   |
|3  |222   |
|4  |2123  |
|5  |2321  |
+---+------+

I am using drop in following way

table_df.drop("ID").show(false) 

This dropping both of the "ID" columns. How can I drop duplicated second column "ID" here?

mike
  • 9,910
  • 3
  • 18
  • 43
vishalraj
  • 65
  • 8

2 Answers2

2

You may use the Dataframe map method to trim the duplicate ID column as given below,

table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").show() 


+---+------+
| ID|Weight|
+---+------+
|  1|    20|
|  2|   200|
|  3|   222|
|  4|  2123|
|  5|  2321|
+---+------+

New schema will be as below,

table_df.map(row => (row.getInt(0),row.getInt(1))).toDF("ID","Weight").schema.treeString

root
 |-- ID: integer (nullable = false)
 |-- Weight: integer (nullable = false)
suresiva
  • 2,676
  • 1
  • 12
  • 21
0

You can drop the column after renaming the particular instance, that you intend to drop, of this column.

Sample code meeting this requirement -

val table_df = Seq((1, 20, 1), (2, 200, 2), (3, 222, 3), (4, 2123, 4), (5, 2321, 5)).toDF("ID", "Weight", "ID")

val newColNames = Seq("ID","Weight","X1")

table_df.toDF(newColNames:_*).show(false)
+---+------+---+
|ID |Weight|X1 |
+---+------+---+
|1  |20    |1  |
|2  |200   |2  |
|3  |222   |3  |
|4  |2123  |4  |
|5  |2321  |5  |
+---+------+---+


table_df.toDF(newColNames:_*).drop("X1").show(false)
+---+------+
|ID |Weight|
+---+------+
|1  |20    |
|2  |200   |
|3  |222   |
|4  |2123  |
|5  |2321  |
+---+------+
Shantanu Kher
  • 873
  • 1
  • 6
  • 13