Home > Back-end >  Using regex to mask SSN in a script (bash / perl / python)
Using regex to mask SSN in a script (bash / perl / python)

Time:06-16

I'm trying to write a small script (preferably in bash, but python or perl would also work) to mask the first 5 digits of a SSN (either in format 123456789 or 123-45-6789 - so it will output XXXXX6789 or XXX-XX-6789 respectively). The input is in a text file.

I know I should be able to do this with sed, but I'm having trouble with creating the right regex (and then I have to do the substitution). It should properly handle all these use cases:

123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

So the SSN can occur at the beginning of a line, in the middle somewhere, or at the end.

The output (for the first two lines, for example) should have the first 5 numbers masked, say with Xs):

XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.

I've managed to get a grep regex that correctly matches only the expressions I want:

grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile

I think I should be able to use grouping in sed or awk to get the results I want, but none of the things I've tried have worked.

CodePudding user response:

Using sed

$ sed '/\<[0-9]\{9\}\>\|\<[0-9-]\{11\}\>/{s/[0-9]\{5\}/XXXXX/;s/[0-9]\{3\}-[0-9]\{2\}/XXX-XX/g}' input_file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

CodePudding user response:

With GNU awk for the 3rd arg to match() and gensub() and \< \> word boundaries:

$ awk '
    match($0,/(.*)(\<[0-9]{3}-?[0-9]{2})(-?[0-9]{4}\>.*)/,a) {
        $0 = a[1] gensub(/[0-9]/,"X","g",a[2]) a[3]
    }
1' file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

CodePudding user response:

perl -lpe 's/\b[0-9]{3}(-?)[0-9]{2}(-?)([0-9]{4})\b/XXX${1}XX$2$3/g'

You only need to capture the things that will end up in the output: the (possible) dashes and the last four digits. And, Perl's regex syntax eliminates unnecessary backslashes, which is nice.

(Specifically, in perl regex, "magic" functions are always attached to punctuation without backslashes, or alphanumerics with backslashes; backslashing punctuation will always make it non-special.)

CodePudding user response:

Assuming the first 8 lines should have a mask applied (leaving the last 3 lines untouched):

Modifying input file to include dual matching SSN patterns in the first 2 lines:

$ cat testfile
123456789 needs to be matched (and again 123-45-6789)
123-45-6789 does, too (and again 123456789)
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

One sed idea using a modified version of OP's regex:

sed -r 's/\b([0-9]{3})(-{0,1})([0-9]{2})(-{0,1}[0-9]{4})\b/XXX\2XX\4/g' testfile

Where:

  • -r - enable extended regex support (eliminates need to escape parens and braces)
  • ([0-9]{3}) - match 3 digits (1st capture group)
  • (-{0,1}) - match optional - (2nd capture group)
  • ([0-9]{2}) - match 2 digits (3rd capture group)
  • (-{0,1}[0-9]{4}) - match optional - 4 digits (4th capture group)
  • XXX\2XX\4 - replace 1st capture group with XXX, print 2nd capture group as is, replace 3rd capture group with XX, print 4th capture group as is
  • g - apply to all matches in a line

This generates:

XXXXX6789 needs to be matched (and again XXX-XX-6789)
XXX-XX-6789 does, too (and again XXXXX6789)
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out

CodePudding user response:

Grep invert match regex (fixed):

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' input-file.txt

Grep options:

  • -v: Inverts match (prints everything without a match).
  • -E: Uses the Extended regex grammar for the pattern.

Regex detail:

  • ([^0-9]|^): Matches a non-digit or beginning of line.
  • [0-9]{3}-?: Matches 3 digits optionally followed by a dash.
  • [0-9]{2}-?: Matches 2 digits optionally followed by a dash.
  • [0-9]{4}: Matches 4 digits.
  • ([^0-9]|$): Matches a non-digit or end of line.

Testing

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' <<'EOF'
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
EOF

Output of test:

But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
  • Related