0

I have a dataset derived from Pokemon statistics containing a lot of the numerical and categorical data. My end goal is to create a model or recommendation system that a user can input a list of Pokemon and the model finds similar Pokemon they may like. Currently the dataset looks something like this:

ID   Name    Type1    Type2   HP 
001  Bulba.. Grass    Poison  45
ect...

I understand the type1/type2 metric might be problematic, Is there a function that would let me create a new create/modify new columns were if a Pokemon had a particular type it would add a logical value(0 for false, 1 for true) in that new column?

I apologize for a lack luster explanation but what I want is for my dataset to look like this:

ID   Name    Grass  Poison Water  HP 
001  Bulba..    1      1     0    45
ect...
  • Does this answer your question? [How to one hot encode several categorical variables in R](https://stackoverflow.com/questions/48649443/how-to-one-hot-encode-several-categorical-variables-in-r) – LocoGris Nov 09 '19 at 17:59
  • Does this answer your question? [Reshape multiple categorical variables to binary response variables](https://stackoverflow.com/questions/18474896/reshape-multiple-categorical-variables-to-binary-response-variables) – camille Nov 09 '19 at 19:14

1 Answers1

1

tidyr is a package for data reshaping. Here, we'll use pivot_longer() to put it into a long format, where the type names (Type1, Type2) will reside in column "name", while the values (Grass, Poison, etc.) will reside in column "value". We filter out rows with is.na(value) because that means the pokemon did not have a second type. We create an indicator variable -- this gets a 1. Each pokemon will then have indicator == 1 for the types it has. We drop the now extraneous "name" column, and use pivot_wider() to transform each unique value in value into its own column, which will receive indicator's value as the cell value for each row. Finally, we mutate on all numeric columns to replace missings with 0, since we know those pokemon aren't those types. A better solution than mutate_if(is.numeric, ...) would be to compute the unique values of types and use mutate_at(vars(pokemon_types), .... This would not affect other numeric columns unintentionally.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
pokemon <- tibble(ID = c(1,2), Name = c("Bulbasaur", "Squirtle"),
                  Type1 = c("Grass", "Water"), 
                  Type2 = c("Poison", NA),
                  HP = c(40, 50))

pokemon %>% pivot_longer(
  starts_with("Type")
) %>% 
  filter(!is.na(value)) %>% 
  mutate(indicator = 1) %>% 
  select(-name) %>% 
  pivot_wider(names_from = value, values_from = indicator,
              ) %>% 

  mutate_if(is.numeric, .funs = function(x) if_else(is.na(x), 0, x))
#> # A tibble: 2 x 6
#>      ID Name         HP Grass Poison Water
#>   <dbl> <chr>     <dbl> <dbl>  <dbl> <dbl>
#> 1     1 Bulbasaur    40     1      1     0
#> 2     2 Squirtle     50     0      0     1
smingerson
  • 1,173
  • 5
  • 10
  • In your last step, you use `mutate_if` in a way that also catches HP and ID—that could become a problem since you're working on columns that you don't actually intend to change. It would probably be safer to use `mutate_at` and either select the columns you want explicitly, or select e.g. `-ID:-HP`. At that point, since you're making a numeric 0/1 from a logical condition, you can simplify the `ifelse` to something like `as.numeric(!is.na(x))` – camille Nov 09 '19 at 19:13
  • Fair points. I covered `mutate_if()` vs `mutate_at()` in the preceding paragraph. – smingerson Nov 09 '19 at 20:02