I have read-only access to a database where I need to pull data from.
One of the columns is for social security numbers(US). The data is not formatted correctly and I am dealing with entries like:
- 123-45-0000
- 123 45 0000
- 123450000
- 123-45 0000
- 123 45-0000
- 123,45-0000
- 123-450000
- 12345-0000
- 123,45,0000
- 123,45,0000,
Our system needs the SSN to look like this: 123-45-0000
I was able to figure out how to properly format the 3rd one(123450000) with this:
DECLARE @ssn CHAR(12) = 123450000
SELECT stuff(stuff(@ssn, 4, 0, '-'), 7, 0, '-') AS Social_Security_Number
But is there a way to fit all these other ones into the proper format?
CodePudding user response:
Stop worrying about which ones meet your formatting criteria, which ones are close, and how many different ways they're wrong. Just strip the formatting altogether and store numeric only; apply the formatting elsewhere. Much easier to constrain, too.
-- numeric only:
SELECT
REPLACE(REPLACE(REPLACE(garbage,'-',' '),',',' '),' ','')
FROM dbo.SSNs;
-- with formatting added back:
SELECT STUFF(STUFF(
REPLACE(REPLACE(REPLACE(garbage,'-',' '),',',' '),' ',''),
4,0,'-'),7,0,'-')
FROM dbo.SSNs;
Working example in this fiddle.
Once you've fixed the existing data, add a constraint that says something like LEN(SSN) = 9 AND SSN NOT LIKE '%[^0-9]%'
- strip any non-digits from the input on the way in, fail any write that doesn't do that, and apply the dashes - which are presentation only - in the UI (or in a view or a computed column if it can't be where it belongs).