Home > database >  Function that splits string into substrings by len and char
Function that splits string into substrings by len and char

Time:08-05

I'm struggling with following topic.

I have to split a character string into subparts based on defined length. Additional handicap is that when splitting I must consider last occurence of comma which is a divisor of values. This is a sample input string:

4065431,4025075,4045490,4061895,4064846,4069323,3761852,3963407

Let's say, that I want to make a split so substrings have no more than 26 chars. What I expect to get is the following:

4065431,4025075,4045490

4061895,4064846,4069323

3761852,3963407

Based on topics already found I created following function:

ALTER FUNCTION [dbo].[fnRSplitString_byLen] (

        @stringToSplit nvarchar(max),
        @splitLength int
        )
returns
@returnList Table ([Name][nvarchar](max) )
as
BEGIN
    DECLARE @NAME NVARCHAR (max)
    declare @pos int
        while LEN(@stringToSplit) > 0
            BEGIN
            select @pos = len(REVERSE(left(reverse(@stringToSplit),@splitLength-CHARINDEX(',',reverse(@stringToSplit)))))
            select @name = SUBSTRING(@stringToSplit, 1, @pos-1)
            insert into @returnList
            select @name

            select @stringToSplit = SUBSTRING(@stringToSplit, @pos 1, len(@stringToSplit) -@pos)

            END
            insert into @returnList
            select @stringToSplit
    RETURN
END

However I have an issue with that - for unknown reason (I think it's dependent on @splitLength value and the total length of original string) function sometimes stops working as intended and I get some random problems like:

  1. split is done in random places causing that substrings start or end with comma;
  2. when above happens next substring doesn't start with next char index

This is how it usually looks when issue happens:

4065431,4025075,4045490

4061895,4064846,

9323,3761852,3963407

Would you be so kind and guide me how to fix this issue?

Please note that unfortunately function must operate on older version of SQL ( 2014 (SP3-CU4) (KB4500181) - 12.0.6329.1 ).

CodePudding user response:

Please try the following solution.

It is using XML and XQuery power of SQL Server.

Notable points:

  • Converting input string of tokens into XML data type for tokenization.
  • Calculating series values for a grouping based on the running total length of tokens, and using a mod operator along the way.
  • Grouping series to get _start and _end token positions in the XML.
  • Using XQuery .query() and .value() methods to get the final result.

Result of the first CTE

 ---------------- ----- --------- ----------------------------------------------------------------- ------------------- ----------- -------- 
|   token_xml    | pos | str_len |                        str_running_total                        | len_running_total | mod_chunk | series |
 ---------------- ----- --------- ----------------------------------------------------------------- ------------------- ----------- -------- 
| <r>4065431</r> |   1 |       7 |                                                         4065431 |                 7 |         7 |      0 |
| <r>4025075</r> |   2 |       7 |                                                 4065431 4025075 |                15 |        15 |      0 |
| <r>4045490</r> |   3 |       7 |                                         4065431 4025075 4045490 |                23 |        23 |      0 |
| <r>4061895</r> |   4 |       7 |                                 4065431 4025075 4045490 4061895 |                31 |         5 |     26 |
| <r>4064846</r> |   5 |       7 |                         4065431 4025075 4045490 4061895 4064846 |                39 |        13 |     26 |
| <r>4069323</r> |   6 |       7 |                 4065431 4025075 4045490 4061895 4064846 4069323 |                47 |        21 |     26 |
| <r>3761852</r> |   7 |       7 |         4065431 4025075 4045490 4061895 4064846 4069323 3761852 |                55 |         3 |     52 |
| <r>3963407</r> |   8 |       7 | 4065431 4025075 4045490 4061895 4064846 4069323 3761852 3963407 |                63 |        11 |     52 |
 ---------------- ----- --------- ----------------------------------------------------------------- ------------------- ----------- -------- 

SQL

DECLARE @string VARCHAR(MAX) = '4065431,4025075,4045490,4061895,4064846,4069323,3761852,3963407'
    , @separator CHAR(1) = ','
    , @tokens_max_len INT = 26;

DECLARE @xmldata XML = TRY_CAST('<root><r><![CDATA['   
      REPLACE(@string, @separator, ']]></r><r><![CDATA[')   
      ']]></r></root>' AS XML);

SELECT @xmldata;

;WITH rs AS
(
    SELECT c.query('.') AS token_xml
        , pos, str_len, str_running_total, len_running_total
        , mod_chunk = len_running_total % @tokens_max_len
        , series = len_running_total - (len_running_total % @tokens_max_len)
    FROM @xmldata.nodes('/root/r') AS t(c)
       CROSS APPLY (SELECT t.c.value('let $n := . return count(/root/*[. << $n[1]])   1','INT') AS pos
             ) AS seq
       CROSS APPLY (SELECT t.c.value('string-length((/root/r[sql:column("pos")]/text())[1])','INT') AS str_len
             ) AS x
       CROSS APPLY (SELECT t.c.query('data(/root/r[position() le sql:column("pos")]/text())') AS str_running_total
             ) AS y
       CROSS APPLY (SELECT LEN(t.c.query('data(/root/r[position() le sql:column("pos")]/text())')
            .value('.', 'VARCHAR(MAX)')) AS len_running_total
             ) AS z
), rs2 AS
(
    SELECT series
        , _start = MIN(pos), _end = MAX(pos)
    FROM rs
    GROUP BY series
)
SELECT * 
    , result = REPLACE(@xmldata.query('data(/root/r[position() ge sql:column("_start") 
                and position() le sql:column("_end")]/text())').value('.', 'VARCHAR(100)')
                , SPACE(1), @separator)
FROM rs2;

Output

 -------- -------- ------ ------------------------- 
| series | _start | _end |         result          |
 -------- -------- ------ ------------------------- 
|      0 |      1 |    3 | 4065431,4025075,4045490 |
|     26 |      4 |    6 | 4061895,4064846,4069323 |
|     52 |      7 |    8 |         3761852,3963407 |
 -------- -------- ------ ------------------------- 

CodePudding user response:

The main issue is with the @Pos calculation, which is locating the last comma in the string, but is then using that to calculate a length to extract from the start of the string.

The following expression ca be used to locate the last comma in the first @splitLength 1 characters of a string,

SELECT @Pos = @splitLength   2 - NULLIF(CHARINDEX(',',reverse(SUBSTRING(D.StringToSplit, 1, @splitLength 1))), 0)

However, there are several potential edge cases to deal with:

  1. Remaining string is already shorter than @splitLength.
  2. First comma appears following more than @splitLength leading characters.
  3. The string contains no commas.

The following should be able to handle all of these cases:

        select @pos = CASE WHEN LEN(@stringToSplit) <= @splitLength
            THEN LEN(@stringToSplit)   1  -- Take it all
            ELSE COALESCE(
                @splitLength   2 - NULLIF(CHARINDEX(',',reverse(SUBSTRING(@stringToSplit, 1, @splitLength 1))), 0),  -- Normal case
                NULLIF(CHARINDEX(',', @stringToSplit), 0), -- Overlength
                LEN(@stringToSplit)   1  -- No comma, might also be overlength
            )
        END

For potentially overlength values, the value will be left intact on a separate result record. This case can be handled differently, such as by forcing a break where there is no delimiter. In that case, the remaining string expression would need adjustment since there is no comma to skip.

The recalculation of remaining text at the bottom of the loop also needs a tweak to handle the terminal case. (This also uses STUFF instead of SUBSTRING to remove the first @pos characters.)

        select @stringToSplit =
            CASE WHEN @Pos <= LEN(@stringToSplit)
            THEN ''
            ELSE STUFF(@stringToSplit, 1, @pos, '')
            END

Lastly, it appear that the post loop insert will always insert an empty string, so appears unnecessary.

The updated function would be something like:

CREATE FUNCTION [dbo].[fnRSplitString_byLen] (

        @stringToSplit nvarchar(max),
        @splitLength int
        )
returns
@returnList Table ([Name][nvarchar](max) )
as
BEGIN
    DECLARE @NAME NVARCHAR (max)
    declare @pos int
    while LEN(@stringToSplit) > 0
    BEGIN
        select @pos = CASE WHEN LEN(@stringToSplit) <= @splitLength
            THEN LEN(@stringToSplit)   1  -- Take it all
            ELSE COALESCE(
                @splitLength   2 - NULLIF(CHARINDEX(',',reverse(SUBSTRING(@stringToSplit, 1, @splitLength 1))), 0),  -- Normal case
                NULLIF(CHARINDEX(',', @stringToSplit), 0), -- Overlength
                LEN(@stringToSplit)   1  -- No comma, might also be overlength
            )
        END

        select @name = SUBSTRING(@stringToSplit, 1, @pos-1)
        insert into @returnList
        select @name

        select @stringToSplit =
            CASE WHEN @Pos > LEN(@stringToSplit)
            THEN ''
            ELSE STUFF(@stringToSplit, 1, @pos, '')
            END

    END
    --insert into @returnList
    --select @stringToSplit
    RETURN
END

There may be other areas for improvement, but I believe this should handle the general case.

See this db<>fiddle for a working demo including a variety of test data.

The XML-based solution by Yitzhak Khabinsky may be the better approach, but this is offered up as one that closely follows the original logic.

  • Related