0

I am attempting to re-map data to their encrypted IDs (I no longer have access to the mapping key that links the encrypted ID with the participant ID)

In my case, df1 has 95K rows and dh2=94K rows, both have identical columns (n=360). I want to merge df1 and df2 (with different number of observations) by all the columns.

For a reproducible example:

df1 = data.frame(PID=c(1:10),
         Sex = c(rep("male", 4), rep("female", 6)),
         Age=c(rep("35",2), "27" ,rep("28", 2), rep("50",2), rep("55", 1), "66", "54")) 


df2 = data.frame(EID=c("PI_1234", "PI_1235", "PI_1236", "PI_1237", "PI_1238"),
    Sex=c("female", "female", "male", "male", "female"),
    Age=c("28", "50", "28", "27", "66") )


df3 =data.frame(PID=c(5, 7, 4, 3, 9), 
    EID=c("PI_1234", "PI_1235", "PI_1236", "PI_1237", "PI_1238"),
    Sex=c("female", "female", "male", "male", "female"),
    Age=c("28", "50", "28", "27", "66") )

I want to create df3 keeping all the observations that match (mapping the PID with the encrypted ID (EID)). Is this possible?

Artjom B.
  • 58,311
  • 24
  • 111
  • 196
Tan
  • 117
  • 7
  • How do we know that PID 5 maps to EID PI_1234, PID 7 to EID PI_1235, etc? And your example does not have identical columns like your description says – Rich Scriven Nov 24 '14 at 17:15
  • I understand the concern re: EID_1235 since both PID 6 and 7 can be potential matches here. In reality, I would have 359 variables to match on with. Given that there is only one female aged 28, EID_1234 maps on to 5. My dataframes has all columns identical except the ID column -- one has PID and the other the encrypted ID (EID). – Tan Nov 24 '14 at 17:22
  • I don't think you've provided enough information in the post. What if there are two females of age 28? They have the same ID? – Rich Scriven Nov 24 '14 at 17:43
  • Your `df1` has 2 `female-50` rows, with different `PID` (6 and 7). But your result, `df3`, has only 1 `female-50` row (with PID=6). What's going on? – jlhoward Nov 24 '14 at 18:13

2 Answers2

2

Seems like merge function is good for that

df3 <- merge(df1, df2)

Or by parameter can be used to specify by what columns to merge

df3 <- merge(df1, df2, by = c("Sex", "Age"))

If you want to reorder columns

df3 <- df3[c(3,4,1,2)]

And then ordering by PID (thank to this question)

df3[with(df3, order(PID)),]
Community
  • 1
  • 1
romants
  • 3,292
  • 1
  • 18
  • 29
  • Is there a way to do this without having to specify the exact columns to merge with? I want to merge with every column. – Tan Nov 24 '14 at 17:27
  • Yeah, I actually added this for more explicit. By default, `merge` does not take `by` parameter. By default, `by = intersect(names(x), names(y))`. – romants Nov 24 '14 at 17:28
2

Here's a data.table solution, which is likely to be faster than merge(...) on a large dataset.

library(data.table)
DT1 <- data.table(df1,key=colnames(df1)[-1])
DT2 <- data.table(df2,key=colnames(df2)[-1])
DT1[DT2,nomatch=0]
#    PID    Sex Age     EID
# 1:   5 female  28 PI_1234
# 2:   6 female  50 PI_1235
# 3:   7 female  50 PI_1235
# 4:   9 female  66 PI_1238
# 5:   3   male  27 PI_1237
# 6:   4   male  28 PI_1236

Note the result is not the same as your df3 because df1 has two rows with female - 50. These both show up in the result (as they should), but not in your df3.

jlhoward
  • 52,898
  • 6
  • 81
  • 125