327

I have a large data set and I would like to read specific columns or drop all the others.

data <- read.dta("file.dta")

I select the columns that I'm not interested in:

var.out <- names(data)[!names(data) %in% c("iden", "name", "x_serv", "m_serv")]

and than I'd like to do something like:

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

to drop all the unwanted columns. Is this the optimal solution?

csgillespie
  • 54,386
  • 13
  • 138
  • 175
leroux
  • 3,358
  • 3
  • 14
  • 8
  • 1
    sleeping over the problem, i was thinking that `subset(data, select=c(...))` helps in my case for dropping vars. the question though was mainly about the `paste("data$",var.out[i],sep="")` part to access columns of interest inside the loop. how can i paste or somehow compose a column name? Thanks to everyone for your attention and your help – leroux Mar 09 '11 at 08:24
  • 9
    Possible duplicate of [Drop columns in R data frame](http://stackoverflow.com/questions/4605206/drop-columns-in-r-data-frame) – jangorecki Dec 08 '15 at 18:47

11 Answers11

414

You should use either indexing or the subset function. For example :

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
  x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8

Then you can use the which function and the - operator in column indexation :

R> df[ , -which(names(df) %in% c("z","u"))]
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Or, much simpler, use the select argument of the subset function : you can then use the - operator directly on a vector of column names, and you can even omit the quotes around the names !

R> subset(df, select=-c(z,u))
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

Note that you can also select the columns you want instead of dropping the others :

R> df[ , c("x","y")]
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6

R> subset(df, select=c(x,y))
  x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
juba
  • 43,082
  • 11
  • 100
  • 113
  • 2
    the `select` argument of the `subset` function did the job perfectly! Thank you juba! – leroux Mar 08 '11 at 15:13
  • 2
    `which` is not neccessary, see Ista's answer. But the subset with `-` is nice! Didn't know that! – Tomas Jul 25 '13 at 19:03
  • 6
    `subset` looks good, but the way it silently drops missing values seems pretty dangerous to me. – static_rtti Jun 05 '14 at 15:33
  • 4
    `subset` is indeed very convenient, but remember to avoid using it unless you're using R interactively. See [the Warning in the function's documentation](http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html) and [this SO question](http://stackoverflow.com/q/9860090/1169233) for more. – Waldir Leoncio Aug 13 '14 at 18:45
  • 1
    If there are several duplicate colnames in your data, the `which` method works fine, while others would only output the only first col. – Jill Clover Feb 05 '16 at 20:57
  • 2
    What could cause this `> subset(testData5, select= -c("myCol")) Error in -c("myCol") : invalid argument to unary operator`? – hhh Jun 08 '17 at 14:46
  • 7
    "you can even omit the quotes around the names !", you actually have to omit the quotes, otherwise you'll get invalid argument to unary operator. If you have certain characters (for example "-") in your names you cannot use this method at all since dropping quotes will cause R to be unable to properly parse your code. – oh54 Jul 19 '17 at 18:10
  • 1
    .@juba - What if I want to drop column where the column name starts with `-`? – Chetan Arvind Patil Jan 22 '19 at 18:06
132

Do not use -which() for this, it is extremely dangerous. Consider:

dat <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
dat[ , -which(names(dat) %in% c("z","u"))] ## works as expected
dat[ , -which(names(dat) %in% c("foo","bar"))] ## deletes all columns! Probably not what you wanted...

Instead use subset or the ! function:

dat[ , !names(dat) %in% c("z","u")] ## works as expected
dat[ , !names(dat) %in% c("foo","bar")] ## returns the un-altered data.frame. Probably what you want

I have learned this from painful experience. Do not overuse which()!

Ista
  • 9,106
  • 2
  • 32
  • 30
50

First, you can use direct indexing (with booleans vectors) instead of re-accessing column names if you are working with the same data frame; it will be safer as pointed out by Ista, and quicker to write and to execute. So what you will only need is:

var.out.bool <- !names(data) %in% c("iden", "name", "x_serv", "m_serv")

and then, simply reassign data:

data <- data[,var.out.bool] # or...
data <- data[,var.out.bool, drop = FALSE] # You will need this option to avoid the conversion to an atomic vector if there is only one column left

Second, quicker to write, you can directly assign NULL to the columns you want to remove:

data[c("iden", "name", "x_serv", "m_serv")] <- list(NULL) # You need list() to respect the target structure.

Finally, you can use subset(), but it cannot really be used in the code (even the help file warns about it). Specifically, a problem to me is that if you want to directly use the drop feature of susbset() you need to write without quotes the expression corresponding to the column names:

subset( data, select = -c("iden", "name", "x_serv", "m_serv") ) # WILL NOT WORK
subset( data, select = -c(iden, name, x_serv, m_serv) ) # WILL

As a bonus, here is small benchmark of the different options, that clearly shows that subset is the slower, and that the first, reassigning method is the faster:

                                        re_assign(dtest, drop_vec)  46.719  52.5655  54.6460  59.0400  1347.331
                                      null_assign(dtest, drop_vec)  74.593  83.0585  86.2025  94.0035  1476.150
               subset(dtest, select = !names(dtest) %in% drop_vec) 106.280 115.4810 120.3435 131.4665 65133.780
 subset(dtest, select = names(dtest)[!names(dtest) %in% drop_vec]) 108.611 119.4830 124.0865 135.4270  1599.577
                                  subset(dtest, select = -c(x, y)) 102.026 111.2680 115.7035 126.2320  1484.174

Microbench graph

Code is below :

dtest <- data.frame(x=1:5, y=2:6, z = 3:7)
drop_vec <- c("x", "y")

null_assign <- function(df, names) {
  df[names] <- list(NULL)
  df
}

re_assign <- function(df, drop) {
  df <- df [, ! names(df) %in% drop, drop = FALSE]
  df
}

res <- microbenchmark(
  re_assign(dtest,drop_vec),
  null_assign(dtest,drop_vec),
  subset(dtest, select = ! names(dtest) %in% drop_vec),
  subset(dtest, select = names(dtest)[! names(dtest) %in% drop_vec]),
  subset(dtest, select = -c(x, y) ),
times=5000)

plt <- ggplot2::qplot(y=time, data=res[res$time < 1000000,], colour=expr)
plt <- plt + ggplot2::scale_y_log10() + 
  ggplot2::labs(colour = "expression") + 
  ggplot2::scale_color_discrete(labels = c("re_assign", "null_assign", "subset_bool", "subset_names", "subset_drop")) +
  ggplot2::theme_bw(base_size=16)
print(plt)
Antoine Lizée
  • 3,433
  • 1
  • 24
  • 32
  • 2
    I like your second alternative using `NULL`, but why when you put more than two names is necessary to assign it with `list(NULL)` ? I'm only curious to know how it works, because I tried with only one name and I don't need `list()` – Darwin PC Jul 28 '15 at 10:07
  • 3
    @DarwinPC Yes. If you access directly one vector element (with `$` or `[[`), using ` – Antoine Lizée Aug 03 '15 at 18:08
  • Has this behaviour changed? I get identical results with both `NULL` and `list(NULL)`. – J. Mini Mar 19 '21 at 21:46
29

You can also try the dplyr package:

R> df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
R> df
  x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
R> library(dplyr)
R> dplyr::select(df2, -c(x, y))  # remove columns x and y
  z u
1 3 4
2 4 5
3 5 6
4 6 7
5 7 8
Megatron
  • 12,223
  • 9
  • 75
  • 86
  • 5
    Using ```dplyr::select(df2, -one_of(c('x','y')))``` will still work (with a warning) even if some of the named columns don't exist – divibisan Mar 23 '18 at 16:58
  • This was exactly the solution I was searching for @divibisan, thanks! – JJGabe Sep 23 '20 at 17:31
15

Here's a quick solution for this. Say, you have a data frame X with three columns A, B and C:

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6))
> X
  A B C
1 1 3 5
2 2 4 6

If I want to remove a column, say B, just use grep on colnames to get the column index, which you can then use to omit the column.

> X<-X[,-grep("B",colnames(X))]

Your new X data frame would look like the following (this time without the B column):

> X
  A C
1 1 5
2 2 6

The beauty of grep is that you can specify multiple columns that match the regular expression. If I had X with five columns (A,B,C,D,E):

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
  A B C D  E
1 1 3 5 7  9
2 2 4 6 8 10

Take out columns B and D:

> X<-X[,-grep("B|D",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10

EDIT: Considering the grepl suggestion of Matthew Lundberg in the comments below:

> X<-data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,8),E=c(9,10))
> X
  A B C D  E
1 1 3 5 7  9
2 2 4 6 8 10
> X<-X[,!grepl("B|D",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10

If I try to drop a column that's non-existent,nothing should happen:

> X<-X[,!grepl("G",colnames(X))]
> X
  A C  E
1 1 5  9
2 2 6 10
Joben R. Ilagan
  • 1,772
  • 1
  • 14
  • 9
  • 3
    `X[,-grep("B",colnames(X))]` will return no columns in the case where no column name contains `B`, rather than returning all the columns as would be desired. Consider with `X – Matthew Lundberg Dec 25 '15 at 02:33
6

I tried to delete a column while using the package data.table and got an unexpected result. I kind of think the following might be worth posting. Just a little cautionary note.

[ Edited by Matthew ... ]

DF = read.table(text = "
     fruit state grade y1980 y1990 y2000
     apples Ohio   aa    500   100   55
     apples Ohio   bb      0     0   44
     apples Ohio   cc    700     0   33
     apples Ohio   dd    300    50   66
", sep = "", header = TRUE, stringsAsFactors = FALSE)

DF[ , !names(DF) %in% c("grade")]   # all columns other than 'grade'
   fruit state y1980 y1990 y2000
1 apples  Ohio   500   100    55
2 apples  Ohio     0     0    44
3 apples  Ohio   700     0    33
4 apples  Ohio   300    50    66

library('data.table')
DT = as.data.table(DF)

DT[ , !names(dat4) %in% c("grade")]    # not expected !! not the same as DF !!
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

DT[ , !names(DT) %in% c("grade"), with=FALSE]    # that's better
    fruit state y1980 y1990 y2000
1: apples  Ohio   500   100    55
2: apples  Ohio     0     0    44
3: apples  Ohio   700     0    33
4: apples  Ohio   300    50    66

Basically, the syntax for data.table is NOT exactly the same as data.frame. There are in fact lots of differences, see FAQ 1.1 and FAQ 2.17. You have been warned!

Matt Dowle
  • 56,107
  • 20
  • 160
  • 217
Mark Miller
  • 11,294
  • 21
  • 69
  • 119
4
df2 <- df[!names(df) %in% c("c1", "c2")]
Marvin W
  • 2,985
  • 22
  • 15
3

I changed the code to:

# read data
dat<-read.dta("file.dta")

# vars to delete
var.in<-c("iden", "name", "x_serv", "m_serv")

# what I'm keeping
var.out<-setdiff(names(dat),var.in)

# keep only the ones I want       
dat <- dat[var.out]

Anyway, juba's answer is the best solution to my problem!

Cleb
  • 20,118
  • 16
  • 91
  • 131
leroux
  • 3,358
  • 3
  • 14
  • 8
  • Why do you want to do this in a loop? The answers juba's answer shows you how to do it in one step. Why make it more complicated? – Ista Mar 09 '11 at 15:38
  • of course i use the `select` argument of the `subset` function in my code. i just wanted to see how i could access arbitrary columns in a loop in case i wanted to do something else than just dropping the column. the original data set has about 1200 vars and i'm only interested in using 4 of them without knowing where exactly they are. – leroux Mar 09 '11 at 16:02
2

Here is another solution that may be helpful to others. The code below selects a small number of rows and columns from a large data set. The columns are selected as in one of juba's answers except that I use a paste function to select a set of columns with names that are numbered sequentially:

df = read.table(text = "

state county city  region  mmatrix  X1 X2 X3    A1     A2     A3      B1     B2     B3      C1      C2      C3

  1      1     1      1     111010   1  0  0     2     20    200       4      8     12      NA      NA      NA
  1      2     1      1     111010   1  0  0     4     NA    400       5      9     NA      NA      NA      NA
  1      1     2      1     111010   1  0  0     6     60     NA      NA     10     14      NA      NA      NA
  1      2     2      1     111010   1  0  0    NA     80    800       7     11     15      NA      NA      NA

  1      1     3      2     111010   0  1  0     1      2      1       2      2      2      10      20      30
  1      2     3      2     111010   0  1  0     2     NA      1       2      2     NA      40      50      NA
  1      1     4      2     111010   0  1  0     1      1     NA      NA      2      2      70      80      90
  1      2     4      2     111010   0  1  0    NA      2      1       2      2     10     100     110     120

  1      1     1      3     010010   0  0  1    10     20     10     200    200    200       1       2       3
  1      2     1      3     001000   0  0  1    20     NA     10     200    200    200       4       5       9
  1      1     2      3     101000   0  0  1    10     10     NA     200    200    200       7       8      NA
  1      2     2      3     011010   0  0  1    NA     20     10     200    200    200      10      11      12

", sep = "", header = TRUE, stringsAsFactors = FALSE)
df

df2 <- df[df$region == 2, names(df) %in% c(paste("C", seq_along(1:3), sep=''))]
df2

#    C1  C2  C3
# 5  10  20  30
# 6  40  50  NA
# 7  70  80  90
# 8 100 110 120
Mark Miller
  • 11,294
  • 21
  • 69
  • 119
-1

I can´t answer your question in the comments due to low reputation score.

The next code will give you an error because the paste function return a character string

for(i in 1:length(var.out)) {
   paste("data$", var.out[i], sep="") <- NULL
}

Here is a possible solution:

for(i in 1:length(var.out)) {

  text_to_source <- paste0 ("data$", var.out[i], "<- NULL") # Write a line of your
                                                  # code like a character string
  eval (parse (text=text_to_source)) # Source a text that contains a code
}

or just do:

for(i in 1:length(var.out)) {
  data[var.out[i]] <- NULL
}
demongolem
  • 8,796
  • 36
  • 82
  • 101
Andriy T.
  • 1,930
  • 10
  • 23
-3
df = mtcars 
remove vs and am because they are categorical. In the dataset vs is in column number 8, am is in column number 9

dfnum = df[,-c(8,9)]