4

For the purpose of protecting research subjects from being identifiable in data sets, I'm interested in anonymizing vectors in R. However, I also want to be able to refer to the output when writing up the research (e.g. "subject [random id] showed ..."). I've found that I can use the anonymizer package to easily generate short hashs, but while referring to short hashes in writing is doable, it is not exactly ideal (e.g. "subject f4d35fab showed ..." is difficult to remember, a bit of a mouthful, and would make it difficult to distinguish between other hashed data, e.g. "subject f4d35fab from 8b3bd334 showed ...").

Is there a way to either convert hashes into random human-readable strings or to anonymize the data in a non-crypto-centric way even?

  • 3
    How many subjects are you talking about? – Dason Mar 15 '18 at 19:27
  • 1
    Vector of what, string or numeric? Also it would help if you show sample data. And tell us N, the length of your vector. Is it closer to 100 or 1 million? – smci Mar 15 '18 at 19:33
  • Have you thought about [RSA](https://en.wikipedia.org/wiki/RSA_(cryptosystem))? It is fairly simple to implement especially given packages like [gmp](https://cran.r-project.org/web/packages/gmp/gmp.pdf). – Joseph Wood Mar 15 '18 at 20:15
  • So we are all using the same vector of the same length with the same data, how about `set.seed(1); v – smci Mar 15 '18 at 20:20
  • The original vectors would be string vectors, varying in lengths from 10 (in the case of the region a subject resides in, for instance, in which case there wouldn't be many values at all) to perhaps as many as 3000 (in the case of actual subject names). – joshisanonymous Mar 15 '18 at 20:33

3 Answers3

2

What about just assigning a random number to each subject:

> subjects <- c("Matthew", "Mark", "Luke", "John")
> subjects.anon <- sample(length(subjects))
> subjects.anon
[1] 1 4 2 3

Then you can talk about subject 4 with the data that refers to Mark.

And if you want the numbers unrelated to the number of subjects:

sample(1000, length(subjects)) # [1] 789 103 435 983
C. Braun
  • 4,385
  • 12
  • 41
1

Just use a reference list of human readable names and match it up to each unique value of the true ID. It really depends on how many values you need to create aliases for.

One such source is a list of baby names (here, the 1000 most common names from 2010). For example

library(babynames)
library(dplyr)

samples <- data.frame(id=1:50, age=rnorm(50, 30, 5))    

translate <- babynames %>% filter(year==2010) %>% 
  top_n(1000, n) %>% 
  sample_n(length(unique(samples$id))) %>% 
  select(alias_id=name) %>%
  bind_cols(id=unique(samples$id))

translate
#     alias_id    id
#        <chr> <int>
#  1   Savanna     1
#  2    Jasmin     2
#  3   Natalie     3
#  4      Omar     4
#  5   Tristan     5
#  6  Jeremiah     6
#  7   Arielle     7
#  8    Tanner     8
#  9 Francesca     9
# 10     Devin    10
# # ... with 40 more rows

now we have a translation table that we can use to swap out the real IDs for names.

MrFlick
  • 163,738
  • 12
  • 226
  • 242
0

Take the first m characters of the hash, as long as it's unique in the first m. (That value of m will tend to be O(log(N)) where N is the number of subjects.) Here's sample code:

set.seed(1)
v <- do.call(paste0, replicate(n=8, sample(LETTERS, size=100, replace=T), simplify=F))

unique_in_first_m_chars <- function(v, m) {
  length(unique(substring(v, 1, m))) == length(v)
}

unique_in_first_m_chars(v, 4)
[1] TRUE
unique_in_first_m_chars(v, 3)
[1] FALSE
unique_in_first_m_chars(v, 2)
[1] FALSE
smci
  • 26,085
  • 16
  • 96
  • 138