7

I have loaded an s3 bucket with json files and parsed/flattened it in to a pandas dataframe. Now i have a dataframe with 175 columns with 4 columns containing personally identifiable information.

I am looking for a quick solution anonymising those columns (name & adress). I need to keep information for multiples so that if names or adresses of the same person occuring multiple times have the same hash.

Is there existing functionality in pandas or some other package i can utilize for this?

JanBennk
  • 217
  • 4
  • 11

4 Answers4

5

Using a Categorical would be an efficient way to do this - the main caveat is that the numbering will be based solely on the ordering in the data, so some care will be needed if this numbering scheme needs to be used across multiple columns / datasets.

df = pd.DataFrame({'ssn': [1, 2, 3, 999, 10, 1]})

df['ssn_anon'] = df['ssn'].astype('category').cat.codes

df
Out[38]: 
   ssn  ssn_anon
0    1         0
1    2         1
2    3         2
3  999         4
4   10         3
5    1         0
chrisb
  • 39,034
  • 8
  • 55
  • 56
  • is there any way to get the real values back, if you remove the `ssn` column? Can there be a decryption for this type of method? I assume not – Snow Oct 23 '18 at 12:54
4

You can using ngroup or factorize from pandas

df.groupby('ssn').ngroup()
Out[25]: 
0    0
1    1
2    2
3    4
4    3
5    0
dtype: int64

pd.factorize(df.ssn)[0]
Out[26]: array([0, 1, 2, 3, 4, 0], dtype=int64)

In sklearn, if you are doing ML , I will recommend this approach

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.ssn).transform(df.ssn)

Out[31]: array([0, 1, 2, 4, 3, 0], dtype=int64)
BENY
  • 258,262
  • 17
  • 121
  • 165
2

You seem to be looking for a way to encrypt the strings in your dataframe. There are a bunch of python encryption libraries such as cryptography

How to use it is pretty simple, just apply it to each element.

import pandas as pd
from cryptography.fernet import Fernet

df =pd.DataFrame([{'a':'a','b':'b'}, {'a':'a','b':'c'}])
f = Fernet('password')
res = df.applymap(lambda x: f.encrypt(byte(x, 'utf-8'))
# Decrypt
res.applymap(lambda x: f.decrypt(x))

That is probably the best way in terms of security but it would generate a long byte/string and be hard to look at.

# 'a' -> b'gAAAAABaRQZYMjB7wh-_kD-VmFKn2zXajMRUWSAeridW3GJrwyebcDSpqyFGJsCEcRcf68ylQMC83G7dyqoHKUHtjskEtne8Fw=='

Another simple way so solve your problem is to create a function that maps a key to a value and creates a new value if a new key is present.

mapper = {}
def encode(string):
    if x not in mapper:
         # This part can be changed with anything really
         # Such as mapper[x]=randint(-10**10,10**10)
         # Just ensure it would not repeat
         mapper[x] = len(mapper)+1 

return mapper[x]

res = df.applymap(encode)
Jt Miclat
  • 126
  • 7
0

Sounds a bit like you want to be able to reverse the process by maintaining a key somewhere. If your use case allows I would suggest replacing all the values with valid, human readable and irreversible placeholders.

John > Mark

21 Hammersmith Grove rd > 48 Brewer Street

This is good for generating usable test data for remote devs etc. You can use Faker to generate replacement values yourself. If you want to maintain some utility in your data i.e. "replace all addresses with alternate addresses within 2 miles" you could use an api i'm working on called Anon AI. We parse JSON from s3 buckets, find all the PII automatically (including in free text fields) and replace it with placeholders given your spec. We can keep consistency and reversibility if required and it will be most useful if you want to keep a "live" anonymous version of a growing data set. We're in beta at the moment so let me know if you would be interested in testing it out.

rimeice
  • 151
  • 5