Home > Enterprise >  Using SSIS OR T-SQL Split a column of quoted & unquoted comma separated values into multiple columns
Using SSIS OR T-SQL Split a column of quoted & unquoted comma separated values into multiple columns

Time:12-04

I have comma separated data in a column named C0.
The data in C0 looks like this:

C0
"Pacey LLC.",213830ZZ,11/1/2017,11/1/2017,"297,311.74","2,371.40",0.00,"1,325.18",0.00,42.22,"123,986.56"
Mike The Miker,9814140VCD,12/1/2018,12/1/2018,"3,917,751.99","419,743.54","36,642.66","344,090.43",0.00,10.00,"2,434,671.06"

And I want it to end up like this:

F1 F1 F3 F4 F5 F6 F7 F8 F9 F10 F11
"Pacey LLC." 213830ZZ 11/1/2017 11/1/2017 297,311.74 2,371.40 0.00 1,325.18 0.00 42.22 123,986.56
Mike The Miker 9814140VCD 12/1/2018 12/1/2018 3,917,751.99 419,743.54 36,642.66 344,090.43 0.00 10.00 2,434,671.06

I've tried nested replaces, but couldn't find a pattern to reliably search without regex which is T/SQL? I've also tried a TOKEN approach in SSIS by this feller, but neither fruitful.

The nested replace approaches got stuck on the money fields that are under 1,000 (like 0.00) and the SSIS TOKEN approach presumes all fields are quote delimited, which in my example they aren't.

CodePudding user response:

As you were told already, TSQL is the wrong tool for this. Nevertheless this can be done (at least for the set given). If this is a one-time action you might give it a try. If this is a re-occurring task in a real-life scenario I'd try to get the data in an appropriate format.

However, this would work for the given lines:

DECLARE @t1 TABLE(ID INT IDENTITY, YourString NVARCHAR(1000));
INSERT INTO @t1 VALUES(N'"Pacey LLC.",213830ZZ,11/1/2017,11/1/2017,"297,311.74","2,371.40",0.00,"1,325.18",0.00,42.22,"123,986.56"')
                     ,(N'Mike The Miker,9814140VCD,12/1/2018,12/1/2018,"3,917,751.99","419,743.54","36,642.66","344,090.43",0.00,10.00,"2,434,671.06"');

--Your data includes dates in a culture specific format (something really! bad)
--Better switch to ISO8601
--Setting the date format will help, but is NOT recommended

SET DATEFORMAT dmy;

--the first cte will use APPLY together with a computed TOP()
--This will allow to get each single character, one by one.

WITH singleChars AS
(                    
SELECT t.ID
      ,A.Pos
      ,SUBSTRING(t.YourString,A.POs,1) AS CharOnPos
FROM @t1 t
CROSS APPLY(SELECT TOP (LEN(t.YourString)) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) A(Pos) --master..spt_values can be any table with sufficient rows
)

--we continue with a recursive cte
--it will run through the string and find if we are within a quoted area or not

,recCTE AS
(
    SELECT *
          ,CASE WHEN CharOnPos='"' THEN 1 ELSE 0 END AS QuoteIsOpen
          ,CAST(CharOnPos AS NVARCHAR(MAX)) AS GrowingString
    FROM singleChars WHERE Pos=1

    UNION ALL

    SELECT sc.ID,sc.Pos,sc.CharOnPos
          ,A.QuoteIsStillOpen
          ,CONCAT(GrowingString,CASE WHEN sc.CharOnPos=N',' AND A.QuoteIsStillOpen=0 THEN N'$%&' ELSE sc.CharOnPos END)
    FROM singleChars sc
    INNER JOIN recCTE r ON sc.ID = r.ID AND sc.Pos=r.Pos 1 
    CROSS APPLY(VALUES(CASE WHEN sc.CharOnPos='"' THEN CASE WHEN r.QuoteIsOpen=1 THEN 0 ELSE 1 END ELSE r.QuoteIsOpen END )) A(QuoteIsStillOpen)
)

--this CTE performs a trick with TOP 1 WITH TIES together with ORDER BY a partitioned ROW_NUMBER()
--The result will include the final string of the recursion by ID

,newlySeparated AS
(
    SELECT TOP 1 WITH TIES * FROM recCTE
    ORDER BY ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Pos DESC)
)

--The final SELECT uses a trick to split strings position- and type-safe

SELECT A.*
FROM newlySeparated ns
CROSS APPLY OPENJSON(CONCAT(N'[["',REPLACE(REPLACE(ns.GrowingString,'"',''),'$%&','","'),N'"]]'))
WITH(Company    NVARCHAR(100)        '$[0]'
    ,Code1      NVARCHAR(100)        '$[1]'
    ,Date1      DATE                 '$[2]'
    ,Date2      DATE                 '$[3]'
    ,Decimal1   NVARCHAR(100)        '$[4]' --Using a numbers type might work here, this depends on your machine
    ,Decimal2   NVARCHAR(100)        '$[5]'
    ,Decimal3   NVARCHAR(100)        '$[6]'
    ,Decimal4   NVARCHAR(100)        '$[7]'
    ,Decimal5   NVARCHAR(100)        '$[8]'
    ,Decimal6   NVARCHAR(100)        '$[9]'
    ,Decimal7   NVARCHAR(100)        '$[10]') A
OPTION(MAXRECURSION 0);

The result

 ---------------- ------------ ------------ ------------ -------------- ------------ ----------- ------------ ------ ------- -------------- 
| Pacey LLC.     | 213830ZZ   | 2017-01-11 | 2017-01-11 | 297,311.74   | 2,371.40   | 0.00      | 1,325.18   | 0.00 | 42.22 | 123,986.56   |
 ---------------- ------------ ------------ ------------ -------------- ------------ ----------- ------------ ------ ------- -------------- 
| Mike The Miker | 9814140VCD | 2018-01-12 | 2018-01-12 | 3,917,751.99 | 419,743.54 | 36,642.66 | 344,090.43 | 0.00 | 10.00 | 2,434,671.06 |
 ---------------- ------------ ------------ ------------ -------------- ------------ ----------- ------------ ------ ------- -------------- 
  • Related