String_Split function not splitting all strings from DB column-CodePudding

I am tasked with extracting unique values from record, from a previously used 'Notes' Nvarchar(max) field.

CREATE TABLE LWArchive (
Number int,
Notes varchar(MAX))

INSERT INTO LWArchive(Number,Notes)
VALUES(1,'OGC 503360 / 503361 M303834 M303838 M303835 M303836 M303837 M303839 M303840 M303841 303842'),
(2,'OGC = Q.6773'),
(3,'DEED REF = 0001'),
(4,'OGC 50336 / 50336 M30383 M03038 M30383 M30383 M30383 M00303  M303840 M303841 M303842')


select 
a.Number,
s.value
from LWArchive a
    CROSS APPLY STRING_SPLIT(a.notes, N' ') s

The notes field contains multiple barcodes, which are all in a space delimited format.
E.G. 'M000001 M000002 M000003'

I have attempted to split the 'Notes' column using the query below:

select
a.ID,
s.value
from Archive a
CROSS APPLY STRING_SPLIT(a.notes, N' ') s

Unfortuntely, while some have worked correctly, the split results are inconsistant. With a number of the barcodes remaining unaffected within the space delimited results. Example Below:

When the string isn't sourced from a column & entered manually into the function. It functions as expected. Example Below:

I am using SQL server 2019, however as this is legacy data. I've had to convert the database from compatibility 110 to 130 in order to access the String_Split function.

CodePudding user response：

As suggested in the comments you are likely finding an instance where the character is not a white space, but some other character.

DECLARE @string NVARCHAR(MAX) = 'M000001 M000002 M000003'

SELECT *, @string
  FROM STRING_SPLIT(@string,' ')

value   (No column name)
------------------------
M000001 M000001 M000002 M000003
M000002 M000001 M000002 M000003
M000003 M000001 M000002 M000003

This splits as expected. We end up with 3 rows each with one of the delimited values. If we change one of those white spaces for a tab:

SET @string = 'M000001 M000002  M000003'

SELECT *, @string
  FROM STRING_SPLIT(@string,' ')

value           (No column name)
--------------------------------
M000001         M000001 M000002 M000003
M000002 M000003 M000001 M000002 M000003

We can reproduce what you're seeing.

When I've faced similar problems I've found this technique to be useful:

;WITH cte AS (
SELECT @string AS completeString, ASCII(LEFT(@string, 1)) AS asciiCode, LEFT(@string, 1) AS this, RIGHT(@string, LEN(@string)-1) AS that, 1 AS charIndex
UNION ALL
SELECT completeString, ASCII(LEFT(that,1)), LEFT(that,1), RIGHT(that,LEN(that)-1), charIndex 1
  FROM cte
 WHERE that <> ''
)

SELECT *
  FROM cte

This produces a result set for each character/glyph in the string and it's ASCII code.

completeString          asciiCode   this    that                        charIndex
---------------------------------------------------------------------------------
M000001 M000002 M000003 77          M       000001 M000002      M000003 1
M000001 M000002 M000003 48          0       00001 M000002       M000003 2
M000001 M000002 M000003 48          0       0001 M000002        M000003 3
M000001 M000002 M000003 48          0       001 M000002         M000003 4
M000001 M000002 M000003 48          0       01 M000002          M000003 5
M000001 M000002 M000003 48          0       1 M000002           M000003 6
M000001 M000002 M000003 49          1        M000002            M000003 7
M000001 M000002 M000003 32                  M000002             M000003 8
M000001 M000002 M000003 77          M       000002              M000003 9
M000001 M000002 M000003 48          0       00002               M000003 10
M000001 M000002 M000003 48          0       0002                M000003 11
M000001 M000002 M000003 48          0       002                 M000003 12
M000001 M000002 M000003 48          0       02                  M000003 13
M000001 M000002 M000003 48          0       2                   M000003 14
M000001 M000002 M000003 50          2                           M000003 15
M000001 M000002 M000003 9**                                     M000003 16**
M000001 M000002 M000003 77          M                           000003  17
M000001 M000002 M000003 48          0                           00003   18
M000001 M000002 M000003 48          0                           0003    19
M000001 M000002 M000003 48          0                           003     20
M000001 M000002 M000003 48          0                           03      21
M000001 M000002 M000003 48          0                           3       22
M000001 M000002 M000003 51          3                                   23

Now we can clearly see that character 16 is actually a tab (CHAR(9)) not a white space (CHAR(32)).