How to de-identify the data already present in Big query table and then re-identify the same and load in other BQ table.
Thanks
How to de-identify the data already present in Big query table and then re-identify the same and load in other BQ table.
Thanks
The easiest way to do this is using dataflow pipeline call deidentifyContent.
I have created an example by using "de-identifying sensitive data" to de-identify the data with the following transformations which belongs to Data Loss Prevention (DLP) API and then insert them into BigQuery, however to re-identify the data is needed cryptographical tokens. To achieve this, you can use the supported cryptographic methods in Cloud DLP:
In this post, it was mentioned some keys and steps to achieve the transformation using the Deterministic encryption however you can use the one you prefer.
As it was answered, the easy way to do this is following the Dataflow pipeliene tutorial
Last but not least, I share with you an example of how to de-identify by replacing the data generated by a faker library that I used to simulate your BigQuery data.
from faker import Faker
from google.cloud import dlp_v2
fake = Faker()
dlp = dlp_v2.DlpServiceClient()
def create_fake_data(data_length=5):
data = []
headers = [
{"name": "name"}, {"name": "email"},
{"name": "credit_card"}, {"name": "credit_card_provider"},
{"name": "phone"}]
for i in range(data_length):
data.append({"values":
[
{"string_value": fake.unique.first_name()},
{"string_value": fake.free_email()},
{"string_value": fake.unique.credit_card_number()},
{"string_value": fake.credit_card_provider()},
{"string_value": fake.phone_number()},
]
}
)
return {"table": {"headers": headers, "rows": data}}
def deidentify_with_replace(item):
parent = "projects/julio-castor-mx"
inspect_config = {"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "PHONE_NUMBER"}]}
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"primitive_transformation": {
"replace_config": {
"new_value": {"string_value": "[IDENTIFIED]"},
}
}
}
]
}
}
response = dlp.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item
}
)
return response.item
if __name__ == "__main__":
data = create_fake_data()
deidentify_data = deidentify_with_replace(data)
Please consider that the InfoType
parameter define the DLP detectors that create the de-indentify. These detectors are listed here and to de-indentify a table use the Table
object.
You can see more examples in this repository