1

I have a dataset derived from Pokemon statistics containing a lot of the numerical and categorical data. My end goal is to create a model or recommendation system that a user can input a list of Pokemon and the model finds similar Pokemon they may like. Currently the dataset looks something like this:

ID   Name    Type1    Type2   HP  ATK   DEF
001  Bulba.. Grass    Poison  45  49    49
ect...

I want to convert this data into the "long format", because that format is friendlier with a lot of other functions in R, but I am having trouble dealing with the Type1/Type2 columns. Is there a way I can merge those two into columns into a single column(like "Type") and then have the data converted into the new format? Something like this:

ID   Name    Type    Stat   Value
 001  Bulba.. Grass  HP     45
 001  Bulba.. Poison HP     45
 001  Bulba.. Grass  ATK    49
 001  Bulba.. Poison ATK    49

I understand for Pokemon of Dual-types it would make a pseudo entry, but I don't see any cleaner way to accomplish this. I also know about using dpylr's gather function but I can only really accomplish the Stat column using this method, not the Type issue.

Can anyone help me figure out how I can accomplish this or know any other more efficient methods?

  • What do you mean by "dual entry"? Is every row an "observation" per https://vita.had.co.nz/papers/tidy-data.pdf? If so, go ahead and use ```tidyr::gather```. – markhogue Nov 10 '19 at 03:28

1 Answers1

1

1) pivot_longer Reshape the data frame twice like this:

library(dplyr)
library(tidyr)

DF %>%
  pivot_longer(starts_with("Type"), values_to = "Type") %>%
  select(-name) %>%
  pivot_longer(c("HP", "ATK", "DEF"), names_to = "Stat", values_to = "Value")

giving:

# A tibble: 6 x 5
  ID    Name    Type   Stat  Value
  <chr> <chr>   <chr>  <chr> <int>
1 001   Bulba.. Grass  HP       45
2 001   Bulba.. Grass  ATK      49
3 001   Bulba.. Grass  DEF      49
4 001   Bulba.. Poison HP       45
5 001   Bulba.. Poison ATK      49
6 001   Bulba.. Poison DEF      49

2) melt Alternately use melt from data.table twice.

library(data.table)

m1 <- melt(DF, measure.var = grep("Type", names(DF)), value.name = "Type")
melt(m1, measure.var = c("HP", "ATK", "DEF"), 
  variable.name = "Stat", value.name = "Value")[-3]

giving:

   ID    Name   Type Stat Value
1 001 Bulba..  Grass   HP    45
2 001 Bulba.. Poison   HP    45
3 001 Bulba..  Grass  ATK    49
4 001 Bulba.. Poison  ATK    49
5 001 Bulba..  Grass  DEF    49
6 001 Bulba.. Poison  DEF    49

Note

DF in reproducible form was assumed to be:

Lines <- "
ID   Name    Type1    Type2   HP  ATK   DEF
001  Bulba.. Grass    Poison  45  49    49"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, 
  colClasses = list(ID = "character"))
G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
  • whenever I use pivot_longer() I get this "Error: Failed to create output due to bad names. * Choose another strategy with `names_repair` Call `rlang::last_error()` to see a backtrace." – Keith Sanders Nov 10 '19 at 16:28
  • Copy and paste the code in the note into a fresh R session and then copy and paste the code in the answer. You should get the output shown in the answer. – G. Grothendieck Nov 10 '19 at 17:31
  • I still can't get it to work on my dataset. It looks like this: `pokedex_number name type1 type2 defense hp attack sp_attack sp_defense speed base_total 1 Bulbasaur grass poison 49 45 49 65 65 45 318` And my code is this: `DF %>% pivot_longer(starts_with("type"), values_to = "type") %>% select(-name) %>% pivot_longer(c("defense","hp","attack","sp_attack","sp_defense","speed","base_total"), names_to = "Stat", values_to = "Value")` – Keith Sanders Nov 13 '19 at 03:56
  • In the question the column was called `Name` but in the example in your comment it was called `name` and that makes a big difference because `pivot_longer` without a `names_to=` argument by default produces a `name` column and there can't be two columns with the same column name. Rename the input `name` column to be `Name` (or use the `names_to` argument in `pivot_longer` to specify a column name different from the `name` default and change the `select` statement accordingly). – G. Grothendieck Nov 13 '19 at 15:32