Home > database >  How do I get a substring given a pattern?
How do I get a substring given a pattern?

Time:09-23

For SQL Server, let's say I have the following string:

FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris)

I would like to get both the new first name and last name. SUBSTRING() would not be the proper way since there's no pattern on how many chars the new name has.

Thanks in advance.

CodePudding user response:

It would be best if you normalize your design.

However, with your current dataset following this pattern, you may use the following approach that uses CHARINDEX in a recursive CTE and SUBSTRING to extract the desired names.

WITH sample_data as (
   select 'FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris)' as col
),
name_indexes AS (
     SELECT
         col, 
         CHARINDEX('(',col) as index_start ,
         CHARINDEX(')',col) as index_end,
         0 as iteration
     FROM
         sample_data
     UNION ALL
     SELECT
         col, 
         CHARINDEX('(',col,index_start   1) as index_start ,
         CHARINDEX(')',col,index_end   1) as index_end,
         iteration 1 as iteration
     FROM
         name_indexes
     WHERE
         iteration < 3
),
name_pieces AS (
    SELECT
         *,
         SUBSTRING(col,index_start 1,index_end - index_start-1) as name
    FROM
         name_indexes
),
name_changes AS (
    SELECT
        col,
        MAX(CASE WHEN iteration=0 THEN name END) as original_first_name,
        MAX(CASE WHEN iteration=1 THEN name END) as new_first_name,
        MAX(CASE WHEN iteration=2 THEN name END) as original_last_name,
        MAX(CASE WHEN iteration=3 THEN name END) as new_last_name
    FROM
        name_pieces
    GROUP BY 
        col
)
SELECT *
FROM name_changes

col original_first_name new_first_name original_last_name new_last_name
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) Joe Sally Doe Harris

Working demo db<>fiddle here

The first CTE sample_data with column col is only for demonstration purposes as I try to replicate your table. You may omit this CTE and replace sample_data with your table name and col with your column storing the string to extract the changes in the query below:

WITH name_indexes AS (
     SELECT
         col, 
         CHARINDEX('(',col) as index_start ,
         CHARINDEX(')',col) as index_end,
         0 as iteration
     FROM
         sample_data
     UNION ALL
     SELECT
         col, 
         CHARINDEX('(',col,index_start   1) as index_start ,
         CHARINDEX(')',col,index_end   1) as index_end,
         iteration 1 as iteration
     FROM
         name_indexes
     WHERE
         iteration < 3
),
name_pieces AS (
    SELECT
         *,
         SUBSTRING(col,index_start 1,index_end - index_start-1) as name
    FROM
         name_indexes
),
name_changes AS (
    SELECT
        col,
        MAX(CASE WHEN iteration=0 THEN name END) as original_first_name,
        MAX(CASE WHEN iteration=1 THEN name END) as new_first_name,
        MAX(CASE WHEN iteration=2 THEN name END) as original_last_name,
        MAX(CASE WHEN iteration=3 THEN name END) as new_last_name
    FROM
        name_pieces
    GROUP BY 
        col
)
SELECT *
FROM name_changes

The CTE name_indexes is a recursive CTE aiming to extract the indexes of ( and ) to identify the start and end of each name change following the pattern in the string. iteration is used to track the number of changes and this recursive CTE continues while there are less than 3 iterations. The output looks like for this sample string:

col index_start index_end iteration
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 24 28 0
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 33 39 1
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 64 68 2
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 73 80 3

The CTE name_pieces uses the SUBSTRING function to extract the names from the string based on the indexes. The output of this CTE is

col index_start index_end iteration name
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 24 28 0 Joe
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 33 39 1 Sally
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 64 68 2 Doe
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) 73 80 3 Harris

Finally, the CTE name_changes uses a group by and max function with case expressions to consolidate these changes into 1 row per string in the original dataset. You may choose your desired columns from this final projection.

col original_first_name new_first_name original_last_name new_last_name
FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris) Joe Sally Doe Harris

Let me know if this works for you.

CodePudding user response:

Others have already noted that if you are storing that as data in some column in some table then that's not a great schema. But maybe you just have a varchar variable you're trying to parse... your question doesn't actually specify that this data is in some column in some table.

Here is a somewhat "imperative" solution using cross apply. Cross apply is nice in situations like this, because it lets you compose the logic from smaller parts in a way that is easy to read. Is it the "best" way? That depends upon what you mean by best. I would argue it is very easy to understand and therefore maintain.

declare @str varchar(1000) 
   = 'FirstName changed from (Joe) to (Sally), LastName changed from (Doe) to (Harris)';

select      originalString =  t.string,
            newFirstName   =  substring
                              (
                                 t.string, 
                                 newFirstStart.pos, 
                                 newFirstEnd.pos - newFirstStart.pos
                              ),
            newLastName    =  substring
                              (
                                 t.string, 
                                 newLastStart.pos, 
                                 len(t.string) - newLastStart.pos
                              )
from        (values (@str)) t(string)
cross apply (select charindex('to (', string, 0)   4)                newFirstStart(pos)
cross apply (select charindex(')', string, newFirstStart.pos))       newFirstEnd(pos)
cross apply (select charindex('to (', string, newFirstEnd.pos)   4)  newLastStart(pos);

The question is really how robust you need your solution to be. This answer assumes that your input string always has the parentheses around the names, and the rest of the string is invariant. If that's not exactly true, similar solutions may exist using more convoluted charindex constructions, or patindex, or even splitting the string into words (using one of the many available TSQL string split implementations, or the built in string_split in SQL Server 2016 .

  • Related