Home > Software design >  Find numeric record number in string
Find numeric record number in string

Time:06-16

I've been using this site for longer than I can remember, but I've never had to ask a question before - I've always been able to find what I need - but this one could use some SQL expertise for the best approach since it's kind of a worst case.

I have a text comment field that gets a lot of stuff put in it in no consistent format - it'll include a patient name, hopefully a record ID (which could come from 2 different tables differentiated by the range), spaces, slashes, phone numbers, the initials of whoever entered it, etc. I'll scrub some examples... I need to find the numeric RecordID in here (it references another disconnected system) and ignore the rest. There's a remote possibility it could have 2 record numbers (could be referenced in two different tables) - if so, I need to pick the more important one.

Example Data (scrubbed):
Sanchez, Raul - POS - OS - 489-849-7894 - AB - Conf >> Nothing of value here/return NULL
Smith, Nancy RL 1458968-rl >> Return 1458968
Sudhu, Mandeep intake # 78596/rl >> Return 78596
1887569 Smith, Jonathan-ESK >> Return 1021015
Montana, Joe/1451252rm >> Return 1451252
Johnson, Fredy/1784569/78563/RG >> Return 1784569

My first thought was to do a bunch of nested Replace() functions to turn anything that's not a character into a standard delimiter, parse these all into separate records with split_string(), then only pick the largest number - and that would work in every instance except #5 above... but is that the best way, or is there something better? SQL Version is 2017 for reference to available functions.

This will be wrapped into a scalar valued function.

Thanks!

CodePudding user response:

Here is a way to extract all numbers.

t1 identify all non-digit characters in each text.
t2 replaces all non-digit characters with spaces.
t3 assumes that numbers might be separated by an unknown number of spaces, and replaces every sequence of spaces with a single space.

Enjoy :-)

with
t1 
as 
(
    select  record_id
           ,txt
           ,translate(txt, '1234567890', space(10)) as non_digits 
    from    t
),
t2 
as 
(  
    select  record_id
           ,trim(translate(txt, non_digits, space(len(non_digits)))) as spaced_numberes 
    from    t1
),
t3 
as 
(
    select  record_id
           ,replace(replace(replace(spaced_numberes, ' ', ' *'), '* ', ''), '*', '') as space_delimited_numbers 
           
    from    t2
)
select  t3.record_id
       ,ss.value  
       
from    t3 
        cross apply string_split(space_delimited_numbers, ' ') as ss
record_id value
1 489
1 849
1 7894
2 1458968
3 78596
4 1887569
5 1451252
6 1784569
6 78563
7 336464564
7 77667
7 45676876876

Fiddle

CodePudding user response:

One option could be to extract the first number using some MS SQL pattern-matching:

  • detect the position of the first digit for the first number, using PATINDEX
  • detect the position of the last digit for the first number, using PATINDEX
  • extract the number using SUBSTRING
SELECT string,
       SUBSTRING(string,
                 PATINDEX('%[0-9]%'      , string), 
                 1 PATINDEX('%[0-9][^0-9]%', string)-PATINDEX('%[0-9]%',string))
FROM tab

Once you have this, you could do some cross checks on these numbers.

Check the demo here.

Note: this solution assumes that your interesting value, if any, is found as first number, condition tweakable with a better pattern matching given the domain of your values.

Does it help for your case?

CodePudding user response:

I'm too new to upvote - that's funny! Not sure how to make sure I give proper credit and still leave this open to further critique so let me know if I need to do something...
Anyway I took David's suggestion and turned it into this:

DECLARE @String as varchar(255)
SET @String = 'Montana, Joe/14510252rm/8456,71253|'
;
WITH cte1 AS 
(
    SELECT * FROM (SELECT @String AS String) T
)
, cte2 AS
(
    SELECT String, translate(string, '1234567890', space(10)) AS non_digits
    FROM cte1
)
, cte3 AS 
(  
    SELECT  string, non_digits
           ,trim(translate(String, non_digits, space(len(non_digits)))) as spaced_numbers 
    FROM    cte2
)
, cte4 AS
(
    SELECT string, replace(replace(replace(spaced_numbers, ' ', ' *'), '* ', ''), '*', '') as space_delimited_numbers 
    FROM cte3
)
, cte5 AS
(
    SELECT string , cast(ss.value as int) result
    FROM cte4
    CROSS APPLY string_split(space_delimited_numbers,' ') as ss
)
SELECT TOP 1 result
FROM cte5
WHERE (result between 1000000 AND 2000000)  OR (result between 70000 AND 199999) ORDER BY result DESC
  • Related