Drop data frame columns by name

Question

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:

df$x <- NULL

But I was hoping to do this with fewer commands.

Also, I know that I could drop columns using integer indexing like this:

df <- df[ -c(1, 3:6, 12) ]

But I am concerned that the relative position of my variables may change.

Given how powerful R is, I figured there might be a better way than dropping each column one by one.

Can someone explain to me why R doesn't have something simple like `df#drop(var_name)`, and instead, we need to do these complicated work-arounds? — ifly6, Apr 20 '18 at 17:16
@ifly6 The 'subset()' function in R is about as parsimonious as the 'drop()' function in Python, except you don't need to specify the axis argument... I agree that it's annoying that there can't be just one, ultimate, easy keyword/syntax implemented across the board for something so basic as dropping a column. — Paul Sochacki, Sep 03 '18 at 08:42

score 969 · Accepted Answer · edited Feb 26 '16 at 20:57

969

You can use a simple list of names :

DF <- data.frame(
  x=1:10,
  y=10:1,
  z=rep(5,10),
  a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]

Or, alternatively, you can make a list of those to keep and refer to them by name :

keeps <- c("y", "a")
DF[keeps]

EDIT : For those still not acquainted with the drop argument of the indexing function, if you want to keep one column as a data frame, you do:

keeps <- "y"
DF[ , keeps, drop = FALSE]

drop=TRUE (or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y.

edited Feb 26 '16 at 20:57

Henrik

56,228
12
124
139

answered Jan 05 '11 at 14:40

Joris Meys

98,937
27
203
258

21

the subset function works better as it won't convert a data frame with one column into a vector – mut1na Jun 28 '13 at 09:06
3

@mut1na check the argument drop=FALSE of the indexing function. – Joris Meys Jun 28 '13 at 10:10
6

Shouldn't that be `DF[,keeps]` instead of `DF[keeps]` ? – lindelof Oct 28 '14 at 13:53
8

@lindelof No. It can, but then you have to add drop=FALSE to keep R from converting your data frame to a vector if you only select a single column. Don't forget that data frames are lists, so list selection (one-dimensional like I did) works perfectly well and always returns a list. Or a data frame in this case, which is why I prefer to use it. – Joris Meys Oct 28 '14 at 19:05
7

@AjayOhri Yes, it would. Without a comma, you use the "list" way of selecting, which means that even when you extract a single column, you still get a data frame returned. If you use the "matrix" way, as you do, you should be aware that if you only select a single column, you get a vector instead of a data frame. To avoid that, you need to add drop=FALSE. As explained in my answer, and in the comment right above yours... – Joris Meys Jul 07 '15 at 13:55
Why not use `DF[!(names(DF) %in% drops)]`? By what has been said in these comments, it should be equivalent to using `drop=FALSE`, which is probably what we want anyway. – J. Mini Mar 19 '21 at 21:56

score 486 · Answer 2 · edited Apr 08 '18 at 01:02

486

There's also the subset command, useful if you know which columns you want:

df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))

UPDATED after comment by @hadley: To drop columns a,c you could do:

df <- subset(df, select = -c(a, c))

edited Apr 08 '18 at 01:02

Max Ghenis

10,717
13
59
109

answered Jan 05 '11 at 14:52

Prasad Chalasani

18,647
5
43
71

3

I really wish the R `subset` function had an option like "allbut = FALSE", which "inverts" the selection when set to TRUE, i.e. retains all columns *except* those in the `select` list. – Prasad Chalasani Jan 05 '11 at 14:56
4

@prasad, see @joris answer below. A subset without any subset criteria is a bit of overkill. Try simply: `df[c("a", "c")]` – JD Long Jan 05 '11 at 15:16
@JD I knew that, but I like the syntactic convenience of the `subset` command where you don't need to put quotes around the column names -- I guess I don't mind typing a few extra characters just to avoid quoting names :) – Prasad Chalasani Jan 05 '11 at 15:18
oh that's a good point. I hadn't thought about the quote issue. – JD Long Jan 05 '11 at 15:24
12

Note that you shouldn't use `subset` inside other functions. – Ari B. Friedman Oct 03 '12 at 14:42
2

@mac http://stackoverflow.com/questions/12850141/programming-safe-version-of-subset-to-evaluate-its-condition-while-called-from/12852005#12852005 – Ari B. Friedman Sep 30 '13 at 18:33
subset(df, select = -c(b)) @spacetyper – Puriney Apr 26 '19 at 19:24
I was using this, but I am getting an error now: `Error: unexpected '=' in "DT – Tom Jul 30 '20 at 08:04

Max Ghenis · Answer 3 · 2018-12-14T06:54:18.493

228

within(df, rm(x))

is probably easiest, or for multiple variables:

within(df, rm(x, y))

Or if you're dealing with data.tables (per How do you delete a column by name in data.table?):

dt[, x := NULL]   # Deletes column x by reference instantly.

dt[, !"x"]   # Selects all but x into a new data.table.

or for multiple variables

dt[, c("x","y") := NULL]

dt[, !c("x", "y")]

edited Dec 14 '18 at 06:54

answered Sep 28 '13 at 22:28

Max Ghenis

10,717
13
59
109

35

`within(df, rm(x))` is *by far* the cleanest solution. Given that this is a possibility, every other answer seems unnecessarily complicated by an order of magnitude. – Miles Erickson Oct 02 '15 at 01:00
2

Note that `within(df, rm(x))` will _not_ work if there are duplicate columns named `x` in `df`. – MichaelChirico Jul 15 '16 at 19:51
2

@MichaelChirico to clarify, it removes neither but seems to change the data's values. One has bigger problems if this is the case, but here's an example: `df – Max Ghenis Mar 10 '17 at 22:23
1

@MilesErickson Problem is that you rely on a function `within()` which is powerful but also uses NSE. The note on the help page states clearly that for programming sufficient care should be used. – Joris Meys Dec 13 '18 at 13:45
@MilesErickson How often would one encounter a dataframe with duplicate names in it? – HSchmale Jan 03 '19 at 19:26
@HSchmale `df – J. Mini Mar 18 '21 at 01:43
Two other major benefits of `within`: You do not need to pass a `drop=FALSE` argument and `rm` will warn you if the column that you are attempting to delete is missing. `[` is not so kind. – J. Mini Apr 04 '21 at 20:09

score 125 · Answer 4 · answered Jan 05 '11 at 14:40

125

You could use %in% like this:

df[, !(colnames(df) %in% c("x","bar","foo"))]

answered Jan 05 '11 at 14:40

Joshua Ulrich

163,034
29
321
400

1

Am I missing something, or is this effectively the same solution as the first part of Joris' answer? `DF[ , !(names(DF) %in% drops)]` – Daniel Fletcher Apr 28 '16 at 05:46
11

@DanielFletcher: it's the same. Look at the timestamps on the answers. We answered at the same time... 5 years ago. :) – Joshua Ulrich Apr 28 '16 at 13:01
6

Nutty. `identical(post_time_1, post_time_2) [1] TRUE` =D – Daniel Fletcher Apr 30 '16 at 02:47
Why not drop the comma? I see no reason why `df[!(colnames(df) %in% c("x","bar","foo"))]` would not be equivalent. – J. Mini Mar 19 '21 at 22:19

score 59 · Answer 5 · edited Apr 13 '16 at 20:24

59

list(NULL) also works:

dat <- mtcars
colnames(dat)
# [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
# [11] "carb"
dat[,c("mpg","cyl","wt")] <- list(NULL)
colnames(dat)
# [1] "disp" "hp"   "drat" "qsec" "vs"   "am"   "gear" "carb"

edited Apr 13 '16 at 20:24

MichaelChirico

31,197
13
98
169

answered Feb 12 '14 at 05:34

Vincent

5,233
2
23
31

1

Brilliant! This extends the NULL assignment to a single column in a natural way, and (seemingly) avoids copying (although I don't know what happens under the hood so it may be no more efficient in memory usage ... but seems to me clearly more efficient syntactically.) – c-urchin May 20 '14 at 16:15
6

You do not need list(NULL), NULL is sufficient. e.g: dat[,4]=NULL – CousinCocaine Jul 07 '14 at 08:29
9

OP's question was how to delete multiple columns. dat[,4:5] – Vincent Sep 16 '14 at 00:01
This also doesn't work when trying to remove a duplicated column name. – MichaelChirico Jul 15 '16 at 19:58
@MichaelChirico Works fine for me. Either give a label if you want to remove the first of the columns with the same name or give indices for each column you want to remove. If you have an example where it doesn't work I'd be interested to see it. Perhaps post it as a new question? – Vincent Jul 15 '16 at 22:47
This syntax will also work for `data.table` as well. `dat[,c("mpg","cyl","wt")] – Bear Nov 20 '18 at 20:03

mnel · Answer 6 · 2014-05-21T01:17:06.173

If you want remove the columns by reference and avoid the internal copying associated with data.frames then you can use the data.table package and the function :=

You can pass a character vector names to the left hand side of the := operator, and NULL as the RHS.

library(data.table)

df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# or more simply  DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10) #

DT[, c('a','b') := NULL]

If you want to predefine the names as as character vector outside the call to [, wrap the name of the object in () or {} to force the LHS to be evaluated in the calling scope not as a name within the scope of DT.

del <- c('a','b')
DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, (del) := NULL]
DT <-  <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, {del} := NULL]
# force or `c` would also work.

You can also use set, which avoids the overhead of [.data.table, and also works for data.frames!

df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)

# drop `a` from df (no copying involved)

set(df, j = 'a', value = NULL)
# drop `b` from DT (no copying involved)
set(DT, j = 'b', value = NULL)

IRTFM · Answer 7 · 2014-07-24T16:45:59.397

42

There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:

dfrm2 <- dfrm[ , -grep("\\.B$", names(dfrm)) ]

For the case at hand, using Joris Meys example, it might not be as compact, but it would be:

DF <- DF[, -grep( paste("^",drops,"$", sep="", collapse="|"), names(DF) )]

edited Jul 24 '14 at 16:45

answered Jan 05 '11 at 21:50

IRTFM

240,863
19
328
451

1

If we define `drops` in the first place as `paste0("^", drop_cols, "$")`, this becomes much nicer (read: more compact) with `sapply`: `DF[ , -sapply(drops, grep, names(DF))]` – MichaelChirico Apr 13 '16 at 20:31

score 34 · Answer 8 · answered Nov 22 '14 at 20:37

Another dplyr answer. If your variables have some common naming structure, you might try starts_with(). For example

library(dplyr)
df <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm (5), 
                 var4 = rnorm(5), char1 = rnorm(5), char2 = rnorm(5))
df
#        var2      char1        var4       var3       char2       var1
#1 -0.4629512 -0.3595079 -0.04763169  0.6398194  0.70996579 0.75879754
#2  0.5489027  0.1572841 -1.65313658 -1.3228020 -1.42785427 0.31168919
#3 -0.1707694 -0.9036500  0.47583030 -0.6636173  0.02116066 0.03983268
df1 <- df %>% select(-starts_with("char"))
df1
#        var2        var4       var3       var1
#1 -0.4629512 -0.04763169  0.6398194 0.75879754
#2  0.5489027 -1.65313658 -1.3228020 0.31168919
#3 -0.1707694  0.47583030 -0.6636173 0.03983268

If you want to drop a sequence of variables in the data frame, you can use :. For example if you wanted to drop var2, var3, and all variables in between, you'd just be left with var1:

df2 <- df1 %>% select(-c(var2:var3) )  
df2
#        var1
#1 0.75879754
#2 0.31168919
#3 0.03983268

Not to forget about all the other opportunities that come with ````select()````, such as ````contains()```` or ````matches()````, which also accepts regex. — ha-pu, Mar 01 '19 at 17:31

Preston · Answer 9 · 2019-04-15T21:26:41.577

Dplyr Solution

I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a dplyr chain I use one_of() in the select clause:

Here is a simple, reproducable example:

undesired <- c('mpg', 'cyl', 'hp')

mtcars <- mtcars %>%
  select(-one_of(undesired))

Documentation can be found by running ?one_of or here:

http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

scentoni · Answer 10 · 2012-01-11T19:07:36.830

26

Another possibility:

df <- df[, setdiff(names(df), c("a", "c"))]

or

df <- df[, grep('^(a|c)$', names(df), invert=TRUE)]

edited Jan 11 '12 at 19:07

answered Jan 10 '12 at 23:17

scentoni

679
7
5

2

Too bad that this is not upvoted more because use of `setdiff` is the optimal especially in the case of a very large number of columns. – ctbrown Mar 25 '14 at 21:42
Another angle on this: `df – Joe Apr 21 '16 at 09:44

score 23 · Answer 11 · edited Feb 12 '17 at 07:37

DF <- data.frame(
  x=1:10,
  y=10:1,
  z=rep(5,10),
  a=11:20
)
DF

Output:

    x  y z  a
1   1 10 5 11
2   2  9 5 12
3   3  8 5 13
4   4  7 5 14
5   5  6 5 15
6   6  5 5 16
7   7  4 5 17
8   8  3 5 18
9   9  2 5 19
10 10  1 5 20

DF[c("a","x")] <- list(NULL)

Output:

score 22 · Answer 12 · answered May 02 '13 at 18:42

Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:

df <- data.frame(x=1, y=2)

This gives a data frame

subset(df, select=-y)

but this gives a vector

df[,-2]

This is all explained in ?[ but it's not exactly expected behaviour. Well at least not to me...

score 19 · Answer 13 · edited Aug 23 '17 at 19:16

19

Here is a dplyr way to go about it:

#df[ -c(1,3:6, 12) ]  # original
df.cut <- df %>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6)  # with dplyr::select()

I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using - to remove elements.

edited Aug 23 '17 at 19:16

Jaap

71,900
30
164
175

answered Aug 27 '14 at 17:01

c.gutierrez

4,154
1
17
13

Adding to this that (1) user want replace original df (2) magrittr has `%<>% ` operator to replace input object it could be simplified to `df %<>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6)` – Marek Nov 23 '16 at 11:39
1

If you have a long list of columns to drop, with `dplyr`, it might be easier to group them and put only one minus: `df.cut % select(-c(col.to.drop.1, col.to.drop.2, ..., col.to.drop.n))` – iNyar May 04 '17 at 06:32

score 14 · Answer 14 · answered Jan 05 '11 at 17:21

14

I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:

df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)

# return everything except a and c
df <- df[,-match(c("a","c"),names(df))]
df

answered Jan 05 '11 at 17:21

JD Long

55,115
51
188
278

4

Not a good idea to negate match - `df[,-match(c("e","f"),names(df))]` – hadley Jan 05 '11 at 18:33
.@JDLong - What if I want to drop column where the column name starts with `-`? – Chetan Arvind Patil Jan 22 '19 at 18:04

score 12 · Answer 15 · answered Dec 04 '14 at 14:06

There's a function called dropNamed() in Bernd Bischl's BBmisc package that does exactly this.

BBmisc::dropNamed(df, "x")

The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in magrittr (just like the dplyr approaches):

df %>% BBmisc::dropNamed("x")

score 9 · Answer 16 · answered Oct 25 '16 at 22:57

9

Another solution if you don't want to use @hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:

df[,-which(names(df) == "COLUMN_NAME")]

answered Oct 25 '16 at 22:57

Nick Keramaris

348
3
4

1

(1) Problem is to drop multiple columns at once. (2) It won't work if `COLUMN_NAME` is not in `df` (check yourself: `df – Marek Nov 23 '16 at 11:34
Can you give some more information about this answer? – Akash Nayak Jan 17 '18 at 13:04

sbha · Answer 17 · 2020-03-18T01:41:44.450

Beyond select(-one_of(drop_col_names)) demonstrated in earlier answers, there are a couple other dplyr options for dropping columns using select() that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):

library(dplyr)
starwars %>% 
  select(-(name:mass)) %>%        # the range of columns from 'name' to 'mass'
  select(-contains('color')) %>%  # any column name that contains 'color'
  select(-starts_with('bi')) %>%  # any column name that starts with 'bi'
  select(-ends_with('er')) %>%    # any column name that ends with 'er'
  select(-matches('^f.+s$')) %>%  # any column name matching the regex pattern
  select_if(~!is.list(.)) %>%     # not by column name but by data type
  head(2)

# A tibble: 2 x 2
homeworld species
  <chr>     <chr>  
1 Tatooine  Human  
2 Tatooine  Droid

If you need to drop a column that may or may not exist in the data frame, here's a slight twist using select_if() that unlike using one_of() will not throw an Unknown columns: warning if the column name does not exist. In this example 'bad_column' is not a column in the data frame:

starwars %>% 
  select_if(!names(.) %in% c('height', 'mass', 'bad_column'))

score 4 · Answer 18 · answered Jun 15 '18 at 16:51

Provide the data frame and a string of comma separated names to remove:

remove_features <- function(df, features) {
  rem_vec <- unlist(strsplit(features, ', '))
  res <- df[,!(names(df) %in% rem_vec)]
  return(res)
}

Usage:

remove_features(iris, "Sepal.Length, Petal.Width")

milan · Answer 19 · 2018-08-17T11:42:03.720

Find the index of the columns you want to drop using which. Give these indexes a negative sign (*-1). Then subset on those values, which will remove them from the dataframe. This is an example.

DF <- data.frame(one=c('a','b'), two=c('c', 'd'), three=c('e', 'f'), four=c('g', 'h'))
DF
#  one two three four
#1   a   d     f    i
#2   b   e     g    j

DF[which(names(DF) %in% c('two','three')) *-1]
#  one four
#1   a    g
#2   b    h

score 1 · Answer 20 · answered Dec 16 '19 at 11:56

If you have a large data.frame and are low on memory use [ . . . . or rm and within to remove columns of a data.frame, as subset is currently (R 3.6.2) using more memory - beside the hint of the manual to use subset interactively.

getData <- function() {
  n <- 1e7
  set.seed(7)
  data.frame(a = runif(n), b = runif(n), c = runif(n), d = runif(n))
}

DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- DF[setdiff(names(DF), c("a", "c"))] ##
#DF <- DF[!(names(DF) %in% c("a", "c"))] #Alternative
#DF <- DF[-match(c("a","c"),names(DF))]  #Alternative
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used

DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- subset(DF, select = -c(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#357 MB are used

DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF <- within(DF, rm(a, c)) ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used

DF <- getData()
tt <- sum(.Internal(gc(FALSE, TRUE, TRUE))[13:14])
DF[c("a", "c")]  <- NULL ##
sum(.Internal(gc(FALSE, FALSE, TRUE))[13:14]) - tt
#0.1 MB are used

Drop data frame columns by name

20 Answers20

Linked

Related