0

How to de-identify the data already present in Big query table and then re-identify the same and load in other BQ table.

Thanks

sm_patel
  • 1
  • 1
  • Have you tried the following article ["Validating de-identified data in BigQuery and re-identifying PII data"](https://cloud.google.com/solutions/validating-de-identified-data-bigquery-re-identifying-pii-data#objectives)? This article is the 4th of a series of 4. So I recommend to take a look on this to know that are usefull for you. This is the first one ["De-identification and re-identification of PII in large-scale datasets using Cloud DLP"](https://cloud.google.com/solutions/de-identification-re-identification-pii-using-cloud-dlp) – July Jan 12 '21 at 00:56
  • Yes I have checked this.But my target is to pick the data from Bigquery itself and deidentify and load into other table. – sm_patel Jan 12 '21 at 18:00

2 Answers2

0

The easiest way to do this is using dataflow pipeline call deidentifyContent.

Jordanna Chord
  • 688
  • 3
  • 10
0

I have created an example by using "de-identifying sensitive data" to de-identify the data with the following transformations which belongs to Data Loss Prevention (DLP) API and then insert them into BigQuery, however to re-identify the data is needed cryptographical tokens. To achieve this, you can use the supported cryptographic methods in Cloud DLP:

  • Deterministic encryption using AES-SIV
  • Format preserving encryption
  • Cryptographic hashing

In this post, it was mentioned some keys and steps to achieve the transformation using the Deterministic encryption however you can use the one you prefer.

As it was answered, the easy way to do this is following the Dataflow pipeliene tutorial

Last but not least, I share with you an example of how to de-identify by replacing the data generated by a faker library that I used to simulate your BigQuery data.


from faker import Faker
from google.cloud import dlp_v2

fake = Faker()
dlp = dlp_v2.DlpServiceClient()

def create_fake_data(data_length=5):
    data = []
    headers = [
        {"name": "name"}, {"name": "email"},
        {"name": "credit_card"}, {"name": "credit_card_provider"},
        {"name": "phone"}]

    for i in range(data_length):
        data.append({"values":
                     [
                         {"string_value": fake.unique.first_name()},
                         {"string_value": fake.free_email()},
                         {"string_value": fake.unique.credit_card_number()},
                         {"string_value": fake.credit_card_provider()},
                         {"string_value": fake.phone_number()},
                     ]
                     }
                    )

    return {"table": {"headers": headers, "rows": data}}

def deidentify_with_replace(item):
    parent = "projects/julio-castor-mx"

    inspect_config = {"info_types": [
        {"name": "EMAIL_ADDRESS"},
        {"name": "CREDIT_CARD_NUMBER"},
        {"name": "PHONE_NUMBER"}]}

    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "replace_config": {
                            "new_value": {"string_value": "[IDENTIFIED]"},
                        }
                    }
                }
            ]
        }
    }

    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": item
        }
    )

    return response.item


if __name__ == "__main__":
    data = create_fake_data()
    deidentify_data = deidentify_with_replace(data)

Please consider that the InfoType parameter define the DLP detectors that create the de-indentify. These detectors are listed here and to de-indentify a table use the Table object.

You can see more examples in this repository

July
  • 314
  • 3
  • 9