2

I have a data set that contains some missing values which can be completed by merging with a another dataset. My example:

This is the updated data set I am working with.

DF1

Name Paper     Book       Mug        soap   computer tablet coffee coupons
1   2           3           4         5      6          7    8      9
2   21          22         23         23    23          7    23     9
3   56          57         58         59    60          7    62     9
4   80.33333    81.33333    82.33333  83    83.66667    7    85     9
5   107.3333    108.3333    109.3333 110    110.6667    7   112     9
6   134.3333    135.3333    136.3333 137    137.6667    7   139     9
7   161.3333    162.3333    163.3333 164    164.6667            
8   188.3333    189.3333    190.3333 191    191.6667    7   193     9
9   215.3333    216.3333    217.3333 218    218.6667    7   220     9
10  242.3333    243.3333    244.3333 245    245.6667    7   247     9
11  269.3333    270.3333    271.3333 272    272.6667    7   274     9
12  296.3333    297.3333    298.3333 299    299.6667            
13  323.3333    324.3333    325.3333 326    326.6667    7   328     9
14  350.3333    351.3333    352.3333 353    353.6667    7   355     9
15  377.3333    378.3333    379.3333 380    380.6667            
16  404.3333    405.3333    406.3333 407    407.6667    7   409     9
17  431.3333    432.3333    433.3333 434    434.6667    7   436     9
18  458.3333    459.3333    460.3333 461    461.6667    7   463     9
19  485.3333    486.3333    487.3333 488    488.6667            



DF2
Name    Paper   Book    Mug soap    computer    tablet  coffee  coupons
7   161.3333    162.3333    163.3333    164 164.6667    6   6   6
12  296.3333    297.3333    298.3333    299 299.6667    88  96  25
15  377.3333    378.3333    379.3333    380 380.6667    88  62  25
19  485.3333    486.3333    487.3333    488 488.6667    88  88  78

I want to get:

Name Paper     Book       Mug        soap   computer tablet coffee coupons
1   2           3           4         5      6          7    8      9
2   21          22         23         23    23          7    23     9
3   56          57         58         59    60          7    62     9
4   80.33333    81.33333    82.33333  83    83.66667    7    85     9
5   107.3333    108.3333    109.3333 110    110.6667    7   112     9
6   134.3333    135.3333    136.3333 137    137.6667    7   139     9
7   161.3333    162.3333    163.3333 164    164.6667    6   6       6
8   188.3333    189.3333    190.3333 191    191.6667    7   193     9
9   215.3333    216.3333    217.3333 218    218.6667    7   220     9
10  242.3333    243.3333    244.3333 245    245.6667    7   247     9
11  269.3333    270.3333    271.3333 272    272.6667    7   274     9
12  296.3333    297.3333    298.3333 299    299.6667    88  96      25      
13  323.3333    324.3333    325.3333 326    326.6667    7   328     9
14  350.3333    351.3333    352.3333 353    353.6667    7   355     9
15  377.3333    378.3333    379.3333 380    380.6667    88   62     25      
16  404.3333    405.3333    406.3333 407    407.6667    7   409     9
17  431.3333    432.3333    433.3333 434    434.6667    7   436     9
18  458.3333    459.3333    460.3333 461    461.6667    7   463     9
19  485.3333    486.3333    487.3333 488    488.6667    88  88      78

I have tried the following code:

DF1[,c(4:6)][is.na(DF1[,c(4:6)]<-DF2[,c(2:4)][match(DF1[,1],DF2[,1])] 
[which(is.na(DF1[,c(4:6)]))]

One of the solutions using dplyr will work, if I omit the columns which are already complete. Not sure if it my version of dplyr, which I have updated last week.

Any help is greatly appreciated! Thanks!

JeffB
  • 83
  • 9

1 Answers1

2

We can do a left join and then coalesce the columns

library(dplyr)
DF1 %>%
     left_join(DF2, by = c('NameVar')) %>% 
      transmute(NameVar, Var1, Var2, 
                Var3 = coalesce(Var3.x, Var3.y), 
                Var4 = coalesce(Var4.x, Var4.y), 
                Var5 = coalesce(Var5.x, Var5.y))

-output

#   NameVar Var1 Var2 Var3 Var4 Var5
#1    Sub1   30   45   40   34   65
#2    Sub2   25   30   30   45   45
#3    Sub3   74   34   25   30   49
#4    Sub4   30   45   40   34   65
#5    Sub5   25   30   69   56   72
#6    Sub6   74   34   74   34   60

Or using data.table

library(data.table)
nm1 <- setdiff(intersect(names(DF1), names(DF2)), 'NameVar')
setDT(DF1)[DF2, (nm1) := Map(fcoalesce, mget(nm1),
     mget(paste0("i.", nm1))), on = .(NameVar)]

data

DF1 <- structure(list(NameVar = c("Sub1", "Sub2", "Sub3", "Sub4", "Sub5", 
"Sub6"), Var1 = c(30L, 25L, 74L, 30L, 25L, 74L), Var2 = c(45L, 
30L, 34L, 45L, 30L, 34L), Var3 = c(40L, NA, NA, 40L, 69L, NA), 
    Var4 = c(34L, NA, NA, 34L, 56L, NA), Var5 = c(65L, NA, NA, 
    65L, 72L, NA)), class = "data.frame", row.names = c(NA, -6L
))

DF2 <- structure(list(NameVar = c("Sub2", "Sub3", "Sub6"), Var3 = c(30L, 
25L, 74L), Var4 = c(45L, 30L, 34L), Var5 = c(45L, 49L, 60L)),
class = "data.frame", row.names = c(NA, 
-3L))
akrun
  • 674,427
  • 24
  • 381
  • 486
  • ah, you beat me to it Arun. – Karthik S Oct 30 '20 at 17:52
  • I am getting an error code that reads: Error: unexpected '=' in : transmute(Var1:Var3, Var4 =" Error: Unexpected ',' in Var5= coalesce (df1$var5, df2$var5," and Error: Unexpected ')' in Var6= coalesce (df1$var6, df2$var6," Any ideas of what I am doing wrong? – JeffB Nov 02 '20 at 14:00
  • @JeffB Can you please check your `packageVersion('dplyr')`. I tried the code again and it is working fine – akrun Nov 02 '20 at 21:00
  • I did update my dplyr package. It works when I drop Var 2 and Var3 from the code. When I include Var2 and Var3 I get an error that reads Error: Problem with `mutate()` input `..2`. x object 'Var2' not found i Input `..2` is `Var2`. Any ideas why this is happening? – JeffB Nov 03 '20 at 14:42
  • @JeffB that is a strange error. Is the error based on the same data from my post – akrun Nov 03 '20 at 19:57
  • It is not. But I did make another mock dataset, and still have the same issues. – JeffB Nov 05 '20 at 13:59
  • @JeffB Can you update. with the `dput` of your example datasets so that I can test it – akrun Nov 05 '20 at 21:00
  • @akrun, Sorry I just saw this. I have updated the data set to which I tried your code. I am getting the same error code. Thank you for your help! – JeffB Nov 10 '20 at 13:32
  • @JeffB Are those `NA` or `""` in your data. If it was `dput`, it would have been easier to know the structure – akrun Nov 10 '20 at 20:45