0

For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:

> State  Year  Org1 Aid1 Proj1   Org2 Aid2 Proj2   Org3 Aid3 Proj3   Org4 Aid4 Proj4 ...
  Italy  2000  A    1000 Arts    B    500  Arts    C    300  Social
  Italy  2001  B    700  Social  A    1000 Envir
  Italy  2002  A    1000 Arts    C    300  Envir
  UK     2000
  UK     2001  Z    2000 Social
  UK     2002  Z    2000 Social
  ...

I'm trying to transform this into dyadic data, which would look like this:

> State   Org   Year   Aid   Proj
  Italy   A     2000   1000  Arts
  Italy   A     2001   1000  Envir
  Italy   A     2002   1000  Arts
  Italy   B     2000   500   Arts
  Italy   B     2001   700   Social
  Italy   C     2000   300   Social
  Italy   C     2002   300   Envir
  UK      Z     2001   2000  Social
  ... 

I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.

I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.

Thanks a lot in advance.

Community
  • 1
  • 1
Fran Villamil
  • 429
  • 4
  • 12

1 Answers1

1

Since you did not use dput, I will try and make some data that resemble yours:

dat = data.frame(State = rep(c("Italy", "UK"), 3),
                 Year = rep(c(2014, 2015, 2016), 2),
                 Org1 = letters[1:6],
                 Aid1 = sample(800:1000, 6),
                 Proj1 = rep(c("A", "B"), 3),
                 Org2 = letters[7:12],
                 Aid2 = sample(600:700, 6),
                 Proj2 = rep(c("C", "D"), 3),
                 stringsAsFactors = FALSE)

dat

#   State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014    a  910     A    g  658     C
# 2    UK 2015    b  926     B    h  681     D
# 3 Italy 2016    c  834     A    i  625     C
# 4    UK 2014    d  858     B    j  620     D
# 5 Italy 2015    e  831     A    k  650     C
# 6    UK 2016    f  821     B    l  687     D

Next I gather the data and then use extract to make 2 new columns and then spread it all again:

library(tidyr)
library(dplyr)

dat %>%
  gather(key, value, -c(State, Year)) %>%
  extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>% 
  spread(key, value) %>% 
  select(-num)

#    State Year Aid Org Proj
# 1  Italy 2014 910   a    A
# 2  Italy 2014 658   g    C
# 3  Italy 2015 831   e    A
# 4  Italy 2015 650   k    C
# 5  Italy 2016 834   c    A
# 6  Italy 2016 625   i    C
# 7     UK 2014 858   d    B
# 8     UK 2014 620   j    D
# 9     UK 2015 926   b    B
# 10    UK 2015 681   h    D
# 11    UK 2016 821   f    B
# 12    UK 2016 687   l    D

Is this the desired output?

jakub
  • 3,941
  • 3
  • 23
  • 39