8

I have a dataframe with about ~250 variables. Unfortunately, all of these variables were imported as character classes from a sql database using sqldf. The problem: all of them should not be character classes. There are numeric variables, integers, as well as dates. I'd like to build a model that runs over all the variables and to do this I need to make sure that variables have the right classes. Doing it one by one is probably best, but still very manual.

How could I automatically correct all classes? Perhaps a way to detect whether there are alphabet characters in the column or only number characters?

I don't think it's possible for an automatic approach to be perfect in correcting all classes. But it might correct most of the classes, then those that are not good, I can take care of them manually.

I am adding a sqldf tag in case anybody knows of any way to correct this when importing the data, but I assume it's not sqldf's fault but rather the database's.

Amir
  • 9,058
  • 9
  • 39
  • 68
jgozal
  • 1,241
  • 3
  • 14
  • 35

1 Answers1

8

The closest thing to "automatic" type conversion on a data frame would probably be

df[] <- lapply(df, type.convert)

where df is your data set. The function type.convert()

Converts a character vector to logical, integer, numeric, complex or factor as appropriate.

Have a read of help(type.convert), it might be just what you want.

In my experience, type.convert() is very reliable. You can use as.is = TRUE if you don't want characters coerced to factors. Plus it's used internally in many important R functions (like read.table), so it's definitely safe.

Here's a quick example of it working on iris. First we'll change all columns to character, then run type.convert() on it.

## Original column classes in iris
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

## Change all columns to character
iris[] <- lapply(iris, as.character)
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#  "character"  "character"  "character"  "character"  "character" 

## Run type.convert()
iris[] <- lapply(iris, type.convert)
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

We can see that the columns were returned to their original classes. This is because type.convert() coerces columns to the "most appropriate" type.

Rich Scriven
  • 90,041
  • 10
  • 148
  • 213
  • hello Richard, I recently used this on a different data frame and it gave this error `Error in FUN(X[[i]], ...) : the first argument must be of mode character` I was wondering if you knew why this was happening – jgozal Mar 13 '16 at 08:32
  • it looks like type.convert() expects a character vector as its first argument. I have tried converting my df to as.character(df) but then it just converted everything into factor type – jgozal Mar 13 '16 at 09:28
  • @jgozal If you want characters to remain characters and not be coerced to factors, set `as.is=TRUE` in `type.convert` – Rich Scriven Mar 13 '16 at 14:42
  • won't that still convert the other columns to characters though? – jgozal Mar 13 '16 at 17:15
  • @jgozal - It will coerce them to their appropriate type. So if R decides they should be numeric, they will be numeric. Try it out. `type.convert(as.character(1:5))` goes back to numeric, `type.convert(letters[1:5])` goes to factor, and `type.convert(letters[1:5], as.is = TRUE)` remains character – Rich Scriven Mar 13 '16 at 17:19
  • so if I understand correctly. If I want to solve this issue and still convert my df to what R thinks each column should be converted to, I should do `df[] – jgozal Mar 13 '16 at 17:32
  • 1
    @jgozal - No, you would have to do `df[] – Rich Scriven Mar 13 '16 at 17:36
  • I see. So just to clarify the process there. Would the function above be converting each vector in the dataframe to a character vector? then type.convert() converts each vector to the appropriate class/type. And I would only leave `as.is =TRUE` if I didn't want type.convert() to convert any character vectors to factors. Is this correct? – jgozal Mar 13 '16 at 17:46