I'm struggling with following topic.
I have to split a character string into subparts based on defined length. Additional handicap is that when splitting I must consider last occurence of comma which is a divisor of values. This is a sample input string:
4065431,4025075,4045490,4061895,4064846,4069323,3761852,3963407
Let's say, that I want to make a split so substrings have no more than 26 chars. What I expect to get is the following:
4065431,4025075,4045490
4061895,4064846,4069323
3761852,3963407
Based on topics already found I created following function:
ALTER FUNCTION [dbo].[fnRSplitString_byLen] (
@stringToSplit nvarchar(max),
@splitLength int
)
returns
@returnList Table ([Name][nvarchar](max) )
as
BEGIN
DECLARE @NAME NVARCHAR (max)
declare @pos int
while LEN(@stringToSplit) > 0
BEGIN
select @pos = len(REVERSE(left(reverse(@stringToSplit),@splitLength-CHARINDEX(',',reverse(@stringToSplit)))))
select @name = SUBSTRING(@stringToSplit, 1, @pos-1)
insert into @returnList
select @name
select @stringToSplit = SUBSTRING(@stringToSplit, @pos 1, len(@stringToSplit) -@pos)
END
insert into @returnList
select @stringToSplit
RETURN
END
However I have an issue with that - for unknown reason (I think it's dependent on @splitLength value and the total length of original string) function sometimes stops working as intended and I get some random problems like:
- split is done in random places causing that substrings start or end with comma;
- when above happens next substring doesn't start with next char index
This is how it usually looks when issue happens:
4065431,4025075,4045490
4061895,4064846,
9323,3761852,3963407
Would you be so kind and guide me how to fix this issue?
Please note that unfortunately function must operate on older version of SQL ( 2014 (SP3-CU4) (KB4500181) - 12.0.6329.1 ).
CodePudding user response:
Please try the following solution.
It is using XML and XQuery power of SQL Server.
Notable points:
- Converting input string of tokens into XML data type for tokenization.
- Calculating series values for a grouping based on the running total length of tokens, and using a mod operator along the way.
- Grouping series to get _start and _end token positions in the XML.
- Using XQuery
.query()
and.value()
methods to get the final result.
Result of the first CTE
---------------- ----- --------- ----------------------------------------------------------------- ------------------- ----------- --------
| token_xml | pos | str_len | str_running_total | len_running_total | mod_chunk | series |
---------------- ----- --------- ----------------------------------------------------------------- ------------------- ----------- --------
| <r>4065431</r> | 1 | 7 | 4065431 | 7 | 7 | 0 |
| <r>4025075</r> | 2 | 7 | 4065431 4025075 | 15 | 15 | 0 |
| <r>4045490</r> | 3 | 7 | 4065431 4025075 4045490 | 23 | 23 | 0 |
| <r>4061895</r> | 4 | 7 | 4065431 4025075 4045490 4061895 | 31 | 5 | 26 |
| <r>4064846</r> | 5 | 7 | 4065431 4025075 4045490 4061895 4064846 | 39 | 13 | 26 |
| <r>4069323</r> | 6 | 7 | 4065431 4025075 4045490 4061895 4064846 4069323 | 47 | 21 | 26 |
| <r>3761852</r> | 7 | 7 | 4065431 4025075 4045490 4061895 4064846 4069323 3761852 | 55 | 3 | 52 |
| <r>3963407</r> | 8 | 7 | 4065431 4025075 4045490 4061895 4064846 4069323 3761852 3963407 | 63 | 11 | 52 |
---------------- ----- --------- ----------------------------------------------------------------- ------------------- ----------- --------
SQL
DECLARE @string VARCHAR(MAX) = '4065431,4025075,4045490,4061895,4064846,4069323,3761852,3963407'
, @separator CHAR(1) = ','
, @tokens_max_len INT = 26;
DECLARE @xmldata XML = TRY_CAST('<root><r><![CDATA['
REPLACE(@string, @separator, ']]></r><r><![CDATA[')
']]></r></root>' AS XML);
SELECT @xmldata;
;WITH rs AS
(
SELECT c.query('.') AS token_xml
, pos, str_len, str_running_total, len_running_total
, mod_chunk = len_running_total % @tokens_max_len
, series = len_running_total - (len_running_total % @tokens_max_len)
FROM @xmldata.nodes('/root/r') AS t(c)
CROSS APPLY (SELECT t.c.value('let $n := . return count(/root/*[. << $n[1]]) 1','INT') AS pos
) AS seq
CROSS APPLY (SELECT t.c.value('string-length((/root/r[sql:column("pos")]/text())[1])','INT') AS str_len
) AS x
CROSS APPLY (SELECT t.c.query('data(/root/r[position() le sql:column("pos")]/text())') AS str_running_total
) AS y
CROSS APPLY (SELECT LEN(t.c.query('data(/root/r[position() le sql:column("pos")]/text())')
.value('.', 'VARCHAR(MAX)')) AS len_running_total
) AS z
), rs2 AS
(
SELECT series
, _start = MIN(pos), _end = MAX(pos)
FROM rs
GROUP BY series
)
SELECT *
, result = REPLACE(@xmldata.query('data(/root/r[position() ge sql:column("_start")
and position() le sql:column("_end")]/text())').value('.', 'VARCHAR(100)')
, SPACE(1), @separator)
FROM rs2;
Output
-------- -------- ------ -------------------------
| series | _start | _end | result |
-------- -------- ------ -------------------------
| 0 | 1 | 3 | 4065431,4025075,4045490 |
| 26 | 4 | 6 | 4061895,4064846,4069323 |
| 52 | 7 | 8 | 3761852,3963407 |
-------- -------- ------ -------------------------
CodePudding user response:
The main issue is with the @Pos
calculation, which is locating the last comma in the string, but is then using that to calculate a length to extract from the start of the string.
The following expression ca be used to locate the last comma in the first @splitLength 1 characters of a string,
SELECT @Pos = @splitLength 2 - NULLIF(CHARINDEX(',',reverse(SUBSTRING(D.StringToSplit, 1, @splitLength 1))), 0)
However, there are several potential edge cases to deal with:
- Remaining string is already shorter than @splitLength.
- First comma appears following more than @splitLength leading characters.
- The string contains no commas.
The following should be able to handle all of these cases:
select @pos = CASE WHEN LEN(@stringToSplit) <= @splitLength
THEN LEN(@stringToSplit) 1 -- Take it all
ELSE COALESCE(
@splitLength 2 - NULLIF(CHARINDEX(',',reverse(SUBSTRING(@stringToSplit, 1, @splitLength 1))), 0), -- Normal case
NULLIF(CHARINDEX(',', @stringToSplit), 0), -- Overlength
LEN(@stringToSplit) 1 -- No comma, might also be overlength
)
END
For potentially overlength values, the value will be left intact on a separate result record. This case can be handled differently, such as by forcing a break where there is no delimiter. In that case, the remaining string expression would need adjustment since there is no comma to skip.
The recalculation of remaining text at the bottom of the loop also needs a tweak to handle the terminal case. (This also uses STUFF
instead of SUBSTRING
to remove the first @pos characters.)
select @stringToSplit =
CASE WHEN @Pos <= LEN(@stringToSplit)
THEN ''
ELSE STUFF(@stringToSplit, 1, @pos, '')
END
Lastly, it appear that the post loop insert will always insert an empty string, so appears unnecessary.
The updated function would be something like:
CREATE FUNCTION [dbo].[fnRSplitString_byLen] (
@stringToSplit nvarchar(max),
@splitLength int
)
returns
@returnList Table ([Name][nvarchar](max) )
as
BEGIN
DECLARE @NAME NVARCHAR (max)
declare @pos int
while LEN(@stringToSplit) > 0
BEGIN
select @pos = CASE WHEN LEN(@stringToSplit) <= @splitLength
THEN LEN(@stringToSplit) 1 -- Take it all
ELSE COALESCE(
@splitLength 2 - NULLIF(CHARINDEX(',',reverse(SUBSTRING(@stringToSplit, 1, @splitLength 1))), 0), -- Normal case
NULLIF(CHARINDEX(',', @stringToSplit), 0), -- Overlength
LEN(@stringToSplit) 1 -- No comma, might also be overlength
)
END
select @name = SUBSTRING(@stringToSplit, 1, @pos-1)
insert into @returnList
select @name
select @stringToSplit =
CASE WHEN @Pos > LEN(@stringToSplit)
THEN ''
ELSE STUFF(@stringToSplit, 1, @pos, '')
END
END
--insert into @returnList
--select @stringToSplit
RETURN
END
There may be other areas for improvement, but I believe this should handle the general case.
See this db<>fiddle for a working demo including a variety of test data.
The XML-based solution by Yitzhak Khabinsky may be the better approach, but this is offered up as one that closely follows the original logic.