Home > database >  How can I transform multiple variations of the same data into one format?
How can I transform multiple variations of the same data into one format?

Time:10-26

I have read-only access to a database where I need to pull data from.

One of the columns is for social security numbers(US). The data is not formatted correctly and I am dealing with entries like:

  • 123-45-0000
  • 123 45 0000
  • 123450000
  • 123-45 0000
  • 123 45-0000
  • 123,45-0000
  • 123-450000
  • 12345-0000
  • 123,45,0000
  • 123,45,0000,

Our system needs the SSN to look like this: 123-45-0000

I was able to figure out how to properly format the 3rd one(123450000) with this:

DECLARE @ssn CHAR(12) = 123450000
SELECT stuff(stuff(@ssn, 4, 0, '-'), 7, 0, '-') AS Social_Security_Number

But is there a way to fit all these other ones into the proper format?

CodePudding user response:

Stop worrying about which ones meet your formatting criteria, which ones are close, and how many different ways they're wrong. Just strip the formatting altogether and store numeric only; apply the formatting elsewhere. Much easier to constrain, too.

-- numeric only:

SELECT 
  REPLACE(REPLACE(REPLACE(garbage,'-',' '),',',' '),' ','')
FROM dbo.SSNs;

-- with formatting added back:

SELECT STUFF(STUFF(
  REPLACE(REPLACE(REPLACE(garbage,'-',' '),',',' '),' ',''),
  4,0,'-'),7,0,'-')
FROM dbo.SSNs;

Working example in this fiddle.

Once you've fixed the existing data, add a constraint that says something like LEN(SSN) = 9 AND SSN NOT LIKE '%[^0-9]%' - strip any non-digits from the input on the way in, fail any write that doesn't do that, and apply the dashes - which are presentation only - in the UI (or in a view or a computed column if it can't be where it belongs).

  • Related