3

I have a dataset as follows:

`      GN      Datum                      Land AG..GKV. MTCR..GKV. NSGI..GKV. NSGII..GKV. WA..GKV. ML..GKV.   Wert
8  11693 2012-01-05                     Kenia                                              5A001f          159700
9  11710 2012-01-05                    Indien 1C350.43                                                      18752
10 11749 2012-01-05 Taiwan, Provinz von China  2B350g2                                                      24000
11 11750 2012-01-05                    Indien                                    2B201a1  2B001b2          370296
12 11751 2012-01-05                   Serbien                                     2B201a                     7200
13 11752 2012-01-05                    Indien                                                           4c   6037
14 11753 2012-01-05                  Rumänien                                              2B001d            7830
15 11755 2012-01-06                     China                                    2B201a1                   215987
16 11756 2012-01-06                     China                                    2B201a1                   234736
17 11757 2012-01-06                 Brasilien                                      2B231                    69900
18 11758 2012-01-06                 Brasilien                                      2B231                   139800
19 11788 2012-01-07                 Kongo                                                                  139800

GN i a unique ID. As you can see, for each row, the variables AG..GKV. until ML..GKV hold one value at most, expect in the case where GN == 11750.

I would like to combine these variables into one, but still know what the original variable name was, like so:

`      GN      Datum                      Land variable value    Wert
8  11693 2012-01-05                     Kenia  WA..GKV. 5A001f   159700    
9  11710 2012-01-05                    Indien  AG..GKV. 1C350.43 18752
...

In the case where there are two values per row as in GN == 11750, it should look like this:

`      GN      Datum       Land   variable                value           Wert
11  11750 2012-01-05     Indien  NSGII..GKV.\nWA..GKV.    2B201a1\n2B201a 370296   

i.e., the "double" variable and value should be separated by a new line delimiter.

UPDATE: There is also a special case: When there is no value at all for one row. I want to preserve these (see GN == 19 above and below). variable and value should just be empty in this case.

`      GN      Datum       Land   variable                value           Wert
19  11788 2012-01-07      Kongo                                           370296   

I have no clue how to achieve this and have already tried the reshape package, but probably am too much of a R noob.

Here's a dput (not the same cases as above, but with all three kinds of outcomes):

    structure(list(GN = c(11789L, 11790L, 11791L, 11793L, 11794L, 
11795L, 11796L), Datum = structure(c(15355, 15355, 15355, 15355, 
15355, 15356, 15356), class = "Date"), Land = c("China", "Israel", 
"Pakistan", "Iran, Islamische Republik", "Hong Kong", "Südafrika", 
"Lettland"), AG..GKV. = c("2B350d4", "", "", "", "", "", ""), 
    MTCR..GKV. = c("", "", "", "", "", "", ""), NSGI..GKV. = c("", 
    "", "", "", "", "", ""), NSGII..GKV. = c("", "", "", "", 
    "", "", "2B201a1"), WA..GKV. = c("", "7A002a+b", "", "", 
    "3A001a2c", "6A003b4", "2B001b2"), ML..GKV. = c("", "", "", 
    "", "", "", ""), Wert = c(63720, 25672, 1608000, 10738, 202500, 
    13500, 374873)), .Names = c("GN", "Datum", "Land", "AG..GKV.", 
"MTCR..GKV.", "NSGI..GKV.", "NSGII..GKV.", "WA..GKV.", "ML..GKV.", 
"Wert"), row.names = 49:55, class = "data.frame")
grssnbchr
  • 2,635
  • 6
  • 29
  • 66

2 Answers2

2

You can try (df being your data.frame) :

variable name :

df$variable<-apply(df[,4:9],1,function(x){paste(colnames(df[,4:9])[which(x!="")],collapse="\n")})

value :

df$value<-apply(df[,4:9],1,function(x){paste(x[which(x!="")],collapse="\n")})

Finally, delete the original columns and reorganize :

df<-df[,c(1:3,11:12,10)]

Output :

> df
      GN      Datum                      Land              variable            value   Wert
8  11693 2012-01-05                     Kenia              WA..GKV.           5A001f 159700
9  11710 2012-01-05                    Indien              AG..GKV.         1C350.43  18752
10 11749 2012-01-05 Taiwan, Provinz von China              AG..GKV.          2B350g2  24000
11 11750 2012-01-05                    Indien NSGII..GKV.\nWA..GKV. 2B201a1\n2B001b2 370296
12 11751 2012-01-05                   Serbien           NSGII..GKV.           2B201a   7200
13 11752 2012-01-05                    Indien              ML..GKV.               4c   6037
14 11753 2012-01-05                  Rumänien              WA..GKV.           2B001d   7830
15 11755 2012-01-06                     China           NSGII..GKV.          2B201a1 215987
16 11756 2012-01-06                     China           NSGII..GKV.          2B201a1 234736
17 11757 2012-01-06                 Brasilien           NSGII..GKV.            2B231  69900
18 11758 2012-01-06                 Brasilien           NSGII..GKV.            2B231 139800
Cath
  • 22,843
  • 4
  • 45
  • 80
2

Here's an option with tidyr and dplyr:

library(dplyr)
library(tidyr)
DF %>% gather(Variable, Value, AG..GKV.:ML..GKV.) %>% 
    filter(Value != "") %>% group_by(GN, Datum, Land) %>% 
    summarise_each(funs(paste(unique(.), collapse = "\n")))

#Source: local data frame [11 x 6]
#Groups: GN, Datum
#
#      GN      Datum                      Land   Wert              Variable            Value
#1  11693 2012-01-05                     Kenia 159700              WA..GKV.           5A001f
#2  11710 2012-01-05                    Indien  18752              AG..GKV.         1C350.43
#3  11749 2012-01-05 Taiwan, Provinz von China  24000              AG..GKV.          2B350g2
#4  11750 2012-01-05                    Indien 370296 NSGII..GKV.\nWA..GKV. 2B201a1\n2B001b2
#5  11751 2012-01-05                   Serbien   7200           NSGII..GKV.           2B201a
#6  11752 2012-01-05                    Indien   6037              ML..GKV.               4c
#7  11753 2012-01-05                  Rumänien   7830              WA..GKV.           2B001d
#8  11755 2012-01-06                     China 215987           NSGII..GKV.          2B201a1
#9  11756 2012-01-06                     China 234736           NSGII..GKV.          2B201a1
#10 11757 2012-01-06                 Brasilien  69900           NSGII..GKV.            2B231
#11 11758 2012-01-06                 Brasilien 139800           NSGII..GKV.            2B231

Note that this will only create several entries separated by "\n" if there they are unique, which I assumed is what you want to do.

Edit after OP comment:

For your special case, try the following code:

DF %>% gather(Variable, Value, AG..GKV.:ML..GKV.) %>% 
    group_by(GN, Datum, Land) %>% filter(if(all(Value == "")) row_number() == 1 else Value != "") %>% 
    summarise_each(funs(paste(unique(.), collapse = "\n")))

Explanation:

  • gather reshapes the data from wide to long format (using the columns AG..GKV. until ML..GKV.)
  • Then we group the data by GN, Datum and Land
  • The filter condition is: if all Value entries are empty strings, check if the row number is 1 (that means, return only the first row of such groups) else (if at least one or more Value entries are not empty strings) return only those rows which are not empty
  • Then, on each of the columns - except the grouping columns - summarise all unique values per group into a single string, separated by \n
talat
  • 62,625
  • 18
  • 110
  • 141
  • Thanks. Does tidyr depend on dplyr? Or what library is responsible for what part of the code? – grssnbchr Jan 12 '15 at 13:25
  • In my answer, `gather` is from `tidyr` and the rest is from `dplyr`. You don't have to use them together, but they are sort of made to work well together (by Hadley Wickham) – talat Jan 12 '15 at 13:27
  • Alright. Your solution gives me `Error in filter(., Value != "") : object 'Value' not found` upon execution. Why is that? – grssnbchr Jan 12 '15 at 13:41
  • Never mind, I did not explicitly load `dplyr`, I thought that would have been loaded with `tidyr` anyway. Now it works. – grssnbchr Jan 12 '15 at 13:51
  • I've added a special case that I forgot about first, please see my updated question, maybe you can solve this too with `dplyr` and `tidyr`. The problem is that `filter(Value != "")` makes no distinction between such cases and "normal" ones. – grssnbchr Jan 12 '15 at 15:31
  • I think the key lies in looking for such entries where *all* variables are empty, but I don't know how to do this. – grssnbchr Jan 12 '15 at 15:42
  • Doesn't seem to work, I get exactly the same amount of rows.. Using only `filter(all(Value == ""))` gives 0 results. – grssnbchr Jan 12 '15 at 15:49
  • You should update your question then with a new dput of the refined data. – talat Jan 12 '15 at 15:51
  • 1
    @wnstnsmth, did you copy and paste the whole code from my answer? note that the `filter` statement is now _after_ the group_by. – talat Jan 12 '15 at 15:54
  • Sorry, I've forgot to add a new `dput`. There you go (with other rows, but all three possible cases). A big sorry, I didn't copy paste but only adjusted the filter, now it works. Could you maybe elaborate a bit on your solution - I still don't fully understand what's happening (and will certainly dig into `dplyr` as soon as I have more time). – grssnbchr Jan 12 '15 at 15:57