1

I'm just learning to use R, so it may seems as a noob question for you, but I have some problems with function “subset”. I tried to find an answer in previous questions, but failed(

For example, I have a data frame q with 3 variables x, y, z

q = read.csv("test.csv",encoding = "UTF-8",
                  header = TRUE, sep = ",", na.strings = c("",NA))

Variable x has 4 meanings a, b, c, d

I'm trying to make a data frame q1 only with 2 meanings of variable x - a & c

q1 = subset(q, q$x == 'a' | q$x == 'c')

As a result I have new data frame with 2 meanings of variable x (I check it by opening new dataframe).

But when I table variable x from new dataset q1, I see again 4 meanings, but with the number of b & d =0.

What do I do incorrectly? Why do I see b & d, when I table x in new data set?

Thanks for your help!

khaydarova
  • 43
  • 5

2 Answers2

1

The column in your data frame is a factor, which is another name for a categorical variable, a thing that can take one of a number of possible character values, or "levels", such as "Male" or "Female".

When you subset a factor you don't change the levels. What you are seeing is the levels tabulated, so there are some zeroes.

If you want to avoid this then convert your factors to character values with the as.character function or read them in as character with the stringsAsFactors=FALSE option to read.csv.

Spacedman
  • 86,225
  • 12
  • 117
  • 197
1

factor variables (R version of categorical variables) remember all possible categories unless you tell them not to. You can "forget" them with q1 = droplevels(q1) or by converting the factor to an ordinary string: q1$x = as.character(q1$x)

Gregor Thomas
  • 104,719
  • 16
  • 140
  • 257