-3

I have a spark dataframe and want to drop only the last column.

I tried

df.drop(df.columns.last)` 

but got error AttributeError: 'list' object has no attribute 'last'.

I also tried:

df = df.drop(df.columns[-1])

but this dropped all columns with that has same name as last.

Using Spark 2.4

pault
  • 32,557
  • 9
  • 66
  • 110
Joe
  • 8,787
  • 14
  • 63
  • 129
  • It is better to drop by name. `withColumn` can alter the order of the columns – Salim Jan 23 '20 at 16:46
  • @Joe I would recommend the following: 1) Save the column names to a list: `colnames = df.columns` 2) rename the columns so the names are unique: `df = df.toDF(*range(colnames))` 3) drop the last column `df = df.drop(df.columns[-1])` 4) rename the columns back to the original: `df = df.toDF(*cols[:-1])`. Ping me if the question is reopened and I will post an answer. – pault Jan 23 '20 at 18:26
  • 1
    @pault I reopened the question – Ram Ghadiyaram Jan 23 '20 at 19:32

2 Answers2

2

Here is an approach you can take to drop any column by index.

Suppose you had the following DataFrame:

np.random.seed(1)
data = np.random.randint(0, 10, size=(3,3))

df = spark.createDataFrame(data.astype(int).tolist(), ["a", "b", "a"])
df.show()
#+---+---+---+
#|  a|  b|  a|
#+---+---+---+
#|  5|  8|  9|
#|  5|  0|  0|
#|  1|  7|  6|
#+---+---+---+

First save the original column names.

colnames = df.columns
print(colnames)
#['a', 'b', 'a']

Then rename all of the columns in the DataFrame using range so the new column names are unique (they will simply be the column index).

df = df.toDF(*map(str, range(len(colnames))))
print(df.columns)
#['0', '1', '2']

Now drop the last column and rename the columns using the saved column names from the first step (excluding the last column).

df = df.drop(df.columns[-1]).toDF(*colnames[:-1])
df.show()
#+---+---+
#|  a|  b|
#+---+---+
#|  5|  8|
#|  5|  0|
#|  1|  7|
#+---+---+

You can easily expand this to any index, since we renamed using range.


I broke it up into steps for explaination purposes, but you can also do this more compactly as follows:

colnames = df.columns
df = df.toDF(*map(str, range(len(colnames))))\
    .drop(str(len(colnames)-1))\
    .toDF(*colnames[:-1])
pault
  • 32,557
  • 9
  • 66
  • 110
0

It is better to drop a column by name. Some operation like withColumn can alter the order of the columns. If a dataframe has duplicate names coming out from a join then refer the column by dataframe.column_name instead of referring it by "columnName" which causes ambiguity.

df3 = df1.join(df2, df1.c1 == df2.c1).drop(df2.c1)

In general df.drop(df.columnName)

Salim
  • 1,810
  • 9
  • 13