Here is an approach you can take to drop any column by index.
Suppose you had the following DataFrame:
np.random.seed(1)
data = np.random.randint(0, 10, size=(3,3))
df = spark.createDataFrame(data.astype(int).tolist(), ["a", "b", "a"])
df.show()
#+---+---+---+
#| a| b| a|
#+---+---+---+
#| 5| 8| 9|
#| 5| 0| 0|
#| 1| 7| 6|
#+---+---+---+
First save the original column names.
colnames = df.columns
print(colnames)
#['a', 'b', 'a']
Then rename all of the columns in the DataFrame using range
so the new column names are unique (they will simply be the column index).
df = df.toDF(*map(str, range(len(colnames))))
print(df.columns)
#['0', '1', '2']
Now drop the last column and rename the columns using the saved column names from the first step (excluding the last column).
df = df.drop(df.columns[-1]).toDF(*colnames[:-1])
df.show()
#+---+---+
#| a| b|
#+---+---+
#| 5| 8|
#| 5| 0|
#| 1| 7|
#+---+---+
You can easily expand this to any index, since we renamed using range
.
I broke it up into steps for explaination purposes, but you can also do this more compactly as follows:
colnames = df.columns
df = df.toDF(*map(str, range(len(colnames))))\
.drop(str(len(colnames)-1))\
.toDF(*colnames[:-1])