38

The documentation for tidyr suggests that gather and spread are transitive, but the following example with the "iris" data shows they are not, but it is not clear why. Any clarification would be greatly appreciated

iris.df = as.data.frame(iris)
long.iris.df = iris.df %>% gather(key = feature.measure, value = size, -Species)
w.iris.df = long.iris.df %>% spread(key = feature.measure, value = size, -Species)

I expected the data frame "w.iris.df" to be the same as "iris.df" but received the following error instead:

"Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9..."

My general question is how to reverse an application of "gather" on this sort of dataset.

David Arenburg
  • 87,271
  • 15
  • 123
  • 181
John D Lee
  • 381
  • 1
  • 3
  • 3
  • 32
    It's not transitive because there aren't enough variables to uniquely identify observations. Try adding something like `iris.df$row – hadley Sep 23 '14 at 03:06

2 Answers2

29

Hadley's intervention was unsurprisingly perfect... but I ended up mucking with the syntax a bit after that... so for what it's worth, I post the fully operational code (sorry my syntax is a bit different than above):

library(tidyr)
library(dplyr)

wide <- 
  iris %>%
  mutate(row = row_number()) %>%
  gather(vars, val, -Species, -row) %>%
  spread(vars, val)

head(wide)
#   Species row Petal.Length Petal.Width Sepal.Length Sepal.Width
# 1  setosa   1          1.4         0.2          5.1         3.5
# 2  setosa   2          1.4         0.2          4.9         3.0
# 3  setosa   3          1.3         0.2          4.7         3.2
# 4  setosa   4          1.5         0.2          4.6         3.1
# 5  setosa   5          1.4         0.2          5.0         3.6
# 6  setosa   6          1.7         0.4          5.4         3.9

head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

They are the same.... just need to reorder if u feel like it...

wide <- wide[,c(3, 4, 5, 6, 1)]  ## Reorder and then remove "row" column

and done.

David Arenburg
  • 87,271
  • 15
  • 123
  • 181
Amit Kohli
  • 2,522
  • 2
  • 19
  • 37
0

As the previous answer may not have been sufficiently clear, there's a problem with how you are performing gather that shows itself when you are trying to spread.

The issue is that in the process of gathering you are losing track of which feature.measure belong to which row in the original dataframe, so spread has no idea how to combine individual values into the the "wide" table again.

iris.df = as.data.frame(iris)
long.iris.df = iris.df %>% 
  tibble::rowid_to_column() %>% 
  gather(key = feature.measure, value = size, -Species, -rowid)

#>   rowid Species feature.measure size
#> 1     1  setosa    Sepal.Length  5.1
#> 2     2  setosa    Sepal.Length  4.9
#> 3     3  setosa    Sepal.Length  4.7
#> 4     4  setosa    Sepal.Length  4.6
#> 5     5  setosa    Sepal.Length  5.0
#> 6     6  setosa    Sepal.Length  5.4

Now each value in size preserves its rowid so you will always be able to recombine it back to the wide dataset (removing unnecessary rowid):

w.iris.df = long.iris.df %>% 
  spread(key = feature.measure, value = size) %>% 
  select(-rowid)
head(w.iris.df)
#>   Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> 1  setosa          1.4         0.2          5.1         3.5
#> 2  setosa          1.4         0.2          4.9         3.0
#> 3  setosa          1.3         0.2          4.7         3.2
#> 4  setosa          1.5         0.2          4.6         3.1
#> 5  setosa          1.4         0.2          5.0         3.6
#> 6  setosa          1.7         0.4          5.4         3.9
dmi3kno
  • 2,505
  • 13
  • 27