1

I'm trying to mask PII in a file (.json).

The file contains different email addresses and I would like to change them with other different email addresses.

For example:

"results":

[{ "email1@domain1.com",

"email2@domain2.com",

"email3@domain3.com",

"email4@domain4.com",

"email5@domain5.com" }]

I need to change them to:

"results":

[{ "mockemail1@mockdomain1.com",

"mockemail2@mockdomain2.com",

"mockemail3@mockdomain3.com",

"mockemail4@mockdomain4.com",

"mockemail5@mockdomain5.com" }]

Using sed and regex I have been able to change the addresses to one of the mock email addresses, but I would like to change each email to a different mock email.

The mock email addresses are stored in a file. To get a random address I use:

RandomEmail=$(shuf -n 1 Mock_data.csv | cut -d "|" -f 3)

Any ideas? Thanks!

David Kon
  • 65
  • 1
  • 6
  • 5
    Don't parse JSON with `cut`, download and install shell parser tool `jq` – Inian Jun 04 '18 at 07:46
  • 2
    Your input is not valid JSON - is your real input valid? If so, please [edit] to provide us with a more complete example. – Tom Fenech Jun 04 '18 at 08:16

5 Answers5

2

I saved the first file with emailX@domainX.com to /tmp/1. I created a file /tmp/2 with the content of mockemails:

mockemail1@mockdomain1.com
mockemail2@mockdomain2.com
mockemail3@mockdomain3.com
mockemail4@mockdomain4.com
mockemail5@mockdomain5.com

First I extract a list of email addresses from /tmp/1 and I shuffle mockemails. Then I join using paste emails with shuffled mockemails on columns. Then I convert the lines from format email mockemail into sed argument s/email/mockemail/; and pass it to sed. Then I call sed to suibstitute emails to random mockemail passing /tmp/1 file as stdin.

sed "$(paste <(cat /tmp/1 | sed -n '/@/{s/.*"\(.*@.*.com\)".*/\1/;/^$/d;p;}') <(shuf /tmp/2) | sed 's#\(.*\)\t\(.*\)#s/\1/\2/#' | tr '\n' ';')" </tmp/1

This produces:

"results":

[{ "mockemail1@mockdomain1.com",

"mockemail3@mockdomain3.com",

"mockemail5@mockdomain5.com",

"mockemail4@mockdomain4.com",

"mockemail2@mockdomain2.com" }]
KamilCuk
  • 69,546
  • 5
  • 27
  • 60
2

input.json You've got your JSON file (add an extra breakline at the end that does not appear in this example or read function in bash won't work correctly)

"results":

[{ "email1@mockdomain1.com",

"email2@mockdomain2.com",

"email3@mockdomain3.com",

"email4@mockdomain4.com",

"email5@mockdomain5.com" }]

substitutions.txt (add an extra breakline at the end that does not appear in this example or read function in bash won't work correctly)

domain1.com;mockdomain1.com
domain2.com;mockdomain2.com
domain3.com;mockdomain3.com
domain4.com;mockdomain4.com
domain5.com;mockdomain5.com

script.sh

  #!/bin/bash
  while read _line; do
  unset _ResultLine

  while read _subs; do
    _strSearch=$(echo $_subs | cut -d";" -f1)
    _strReplace=$(echo $_subs | cut -d";" -f2)

    if [ "$(echo "$_line" | grep "@$_strSearch")" ]; then
      echo "$_line" | awk -F"\t" -v strSearch=$_strSearch -v strReplace=$_strReplace \
      '{sub(strSearch,strReplace); print $1}' >> output.json
      _ResultLine="ok"
    fi
  done < substitutions.txt

  [ "$_ResultLine" != "ok" ] && echo "$_line" >> output.json
done < input.json

ouput.json

"results":

[{ "email1@mockdomain1.com",

"email2@mockdomain2.com",

"email3@mockdomain3.com",

"email4@mockdomain4.com",

"email5@mockdomain5.com" }]
  • More importantly, it can be done without the bugs and more efficiently. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for some of the issues. – Ed Morton Jun 04 '18 at 17:37
  • 1
    Thanks for the info @Ed – David Peltier Jun 04 '18 at 18:10
1

Given these input files:

$ cat file1
"results":

[{ "email1@domain1.com",

"email2@domain2.com",

"email3@domain3.com",

"email4@domain4.com",

"email5@domain5.com" }]

$ cat file2
foo|bar|mockemail1@mockdomain1.com|etc
foo|bar|mockemail2@mockdomain2.com|etc
foo|bar|mockemail3@mockdomain3.com|etc
foo|bar|mockemail4@mockdomain4.com|etc
foo|bar|mockemail5@mockdomain5.com|etc

all you need is:

$ shuf file2 | awk 'NR==FNR{a[NR]=$3;next} /@/{$2=a[++c]} 1' FS='|' - FS='"' OFS='"' file1
"results":

[{ "mockemail2@mockdomain2.com",

"mockemail4@mockdomain4.com",

"mockemail5@mockdomain5.com",

"mockemail1@mockdomain1.com",

"mockemail3@mockdomain3.com" }]
Ed Morton
  • 157,421
  • 15
  • 62
  • 152
0

Quick and dirty implementation with python:

hypothesis:

You have a wellformed JSON input:

{
    "results":
    [
        "email1@domain1.com",
        "email2@domain2.com",
        "email3@domain3.com",
        "email4@domain4.com",
        "email5@domain5.com"
    ]
}

you can validate your JSON at this address https://jsonformatter.curiousconcept.com/

code:

import json
import sys


input_message = sys.stdin.read()
json_dict = json.loads(input_message)
results=[]
for elem in json_dict['results']:
        results.append("mock"+elem)
results_dict = {}
results_dict['results']=results
print(json.dumps(results_dict))

command:

$ echo '{"results":["email1@domain1.com","email2@domain2.com","email3@domain3.com","email4@domain4.com","email5@domain5.com"]}' | python jsonConvertor.py 
{"results": ["mockemail1@domain1.com", "mockemail2@domain2.com", "mockemail3@domain3.com", "mockemail4@domain4.com", "mockemail5@domain5.com"]}
Allan
  • 11,170
  • 3
  • 22
  • 43
0

A friend of mine suggested the following elegant solution that works in two parts:

  1. Substitute email addresses with a string.

    sed -E -i 's/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b/EMAIL_TO_REPLACE/g' data.json
    
  2. Iterate the file, and on each iteration substitute the 1st appearance of the string with a random email from the file:

    for email in $(egrep -o EMAIL_TO_REPLACE data.json) ; do 
        sed -i '0,/EMAIL_TO_REPLACE/s//'"$(shuf -n 1 Mock_data.csv | cut -d "|" -f 3)"'/' data.json ; 
    done
    

And that's it.

Thanks Elina!

David Kon
  • 65
  • 1
  • 6
  • There's nothing elegant about that, it's slow, non-portable, and buggy. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for some of the issues. – Ed Morton Jun 04 '18 at 18:07