-1

I am using pyspark 2.4.0, I have an dataframe with below columns

a,b,b
0,1,1.0
1,2,2.0

Without any join I have to keep only either one of b column and remove other b column

How can I achieve this

Naveen Srikanth
  • 529
  • 2
  • 6
  • 17
  • 1
    you have to avoid this, because a column selection by name is simply not possible when you have duplicates. If this is the result of a join, you can define prefixes or suffixes for column names. On this way you have a unique selector for 'b' – b0lle b Jul 16 '20 at 05:36
  • https://stackoverflow.com/a/33779190/8386455 – b0lle b Jul 16 '20 at 05:38

4 Answers4

1

Perhaps this is helpful -


 val df = Seq((0, 1, 1.0), (1, 2, 2.0)).toDF("a", "b", "b")
 df.show(false)
    df.printSchema()

    /**
      * +---+---+---+
      * |a  |b  |b  |
      * +---+---+---+
      * |0  |1  |1.0|
      * |1  |2  |2.0|
      * +---+---+---+
      *
      * root
      * |-- a: integer (nullable = false)
      * |-- b: integer (nullable = false)
      * |-- b: double (nullable = false)
      */
    df.toDF("a", "b", "b2").drop("b2").show(false)
    /**
      * +---+---+
      * |a  |b  |
      * +---+---+
      * |0  |1  |
      * |1  |2  |
      * +---+---+
      */
Som
  • 5,579
  • 1
  • 9
  • 20
  • I have around 200 columns with few of multiple pairs like this . Yes one option is to rename and drop . Manual effort is somewhere required. For me I have to identify and rename the list[column names] accordingly – Naveen Srikanth Jul 17 '20 at 11:02
  • I think its easy to do for multiple columns. Give atry – Som Jul 17 '20 at 12:53
1

i have been in the same situation when i made a jointure. the good practice is to rename the columns before joining the tables: you can refer to this link:

Spark Dataframe distinguish columns with duplicated name

selecting the one column from two columns of same name is confusing, so the good way to do it is to not have columns of same name in one dataframe.

Young
  • 72
  • 7
0

try this :

col_select = list(set(df.columns))
df_fin = df.select(col_select)
Raghu
  • 1,428
  • 4
  • 13
0

This may help you,

Convert your DataFrame into RDD and extract the fields you want and convert back into DataFrame,

from pyspark.sql import Row

rdd = df.rdd.map(lambda l: Row(a=l[0], b=l[1]))

required_df = spark.creataDataFrame(rdd)

+---+---+
|  a|  b|
+---+---+
|  0|  1|
|  1|  2|
+---+---+
Sathiyan S
  • 898
  • 5
  • 13