Complex subsetting data set to data frame

Question

1) I would like to do a subset operation in Gnu R with the data set here to have a resulting data frame only with Brazil, Time and all the Series Name about Income share (like "Income share held by lowest 10%", ""Income share held by lowest 20%" and so on). In total there would be 7 Series Names about Income Share.

I tried the following command but cannot subset more than one "Series.Name":

test <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time"))

2) In an additional second step I would like to delete all rows with NA values.

The complete code I use is the following:

WDI <- read.csv(https://dl.dropboxusercontent.com/u/109495328/WDI_Data_final.csv, na.strings = "..")
library(reshape)
library(reshape2)
WDI <- rename(WDI, (c(Argentina..ARG.="Argentina", Brazil..BRA.="Brazil", Canada..CAN.="Canada", Chile..CHL.="Chile", Colombia..COL.="Colombia", Mexico..MEX.="Mexico", United.States..USA.="USA", Venezuela..RB..VEN.="Venezuela")))
income_brazil_long <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time"))

The data set in question measure 128 kb in size. I thought it might be better to provide the original data rather than to make up some random data in order to prevent misunderstandings. — Til Hund, Jan 14 '15 at 17:06
http://stackoverflow.com/questions/4862178/remove-rows-with-nas-in-data-frame in case also. — Colonel Beauvel, Jan 14 '15 at 17:09

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-01-14T17:23:55.533

Looking at your data, this is actually probably easiest using grepl to help with the subsetting.

We use grepl to search through the "Series.Name" column for any rows that include the string "Income share held". That creates a logical vector indicating the rows we want. The columns we want are the first, the third, and the sixth.

Wrap it all in na.omit to get rid of any rows with NA values.

WDI_Brazil <- na.omit(WDI[grepl("Income share held", WDI$Series.Name), 
                          c(1, 3, 6)])

The data are already "long", so there's no need to melt. What does the data.frame look like?

summary(WDI_Brazil)
#                            Series.Name      Time       Brazil..BRA.   
#  Income share held by fourth 20% :28   Min.   :1981   Min.   : 0.600  
#  Income share held by highest 10%:28   1st Qu.:1988   1st Qu.: 2.895  
#  Income share held by highest 20%:28   Median :1996   Median :10.320  
#  Income share held by lowest 10% :28   Mean   :1996   Mean   :20.948  
#  Income share held by lowest 20% :28   3rd Qu.:2004   3rd Qu.:43.797  
#  Income share held by second 20% :28   Max.   :2012   Max.   :67.310  
#  (Other)                         :28                                  
table(droplevels(WDI_Brazil$Series.Name))
# 
#  Income share held by fourth 20% Income share held by highest 10% Income share held by highest 20% 
#                               28                               28                               28 
#  Income share held by lowest 10%  Income share held by lowest 20%  Income share held by second 20% 
#                               28                               28                               28 
#   Income share held by third 20% 
#                               28

Notice that there are seven factor levels in "Series.Name", as expected.

Awesome! Like it for the clarity. – Til Hund Jan 14 '15 at 17:19 — Til Hund, Jan 14 '15 at 17:19

cdeterman · Answer 2 · 2015-01-14T17:18:51.567

Well, you can do what you are looking for with base functions.

WDI <- read.csv("WDI_Data_final.csv", header=T, na.strings="..")

# The colnames are strange from the file so reset for clarity
colnames(WDI) <- c("Series.Name", "Series.Code", "Time","Time.Code","Argentina",
                   "Brazil", "Canada", "Chile", "Colombia","Mexico",
                   "USA", "Venezuela")

# do the subsetting
test <- with(WDI, 
             WDI[Series.Name=="Income share held by lowest 10%",
                 c("Brazil","Time", "Series.Name")])

# if you want more, use %in% and specify the Series.Names you care about
test <- with(WDI, 
             WDI[Series.Name %in% c("Income share held by lowest 10%", 
                                    "Income share held by lowest 20%"),
                 c("Brazil","Time", "Series.Name")])

# if you want all the 'income shares', the grepl solution above  by
# Ananda is the most concise.

# you can then use reshape2::melt
melted_test <- melt(test, id.vars=c("Series.Name", "Time"))

To remove the NA just use complete.cases

test[complete.cases(test),]

Thank you very much for your answer, cdeterman. I learned with your answer that my code is too complicated. Did not know that `colnames` exist. Thumbs up, man! — Til Hund, Jan 14 '15 at 17:21

Complex subsetting data set to data frame

2 Answers2