I've been using this site for longer than I can remember, but I've never had to ask a question before - I've always been able to find what I need - but this one could use some SQL expertise for the best approach since it's kind of a worst case.
I have a text comment field that gets a lot of stuff put in it in no consistent format - it'll include a patient name, hopefully a record ID (which could come from 2 different tables differentiated by the range), spaces, slashes, phone numbers, the initials of whoever entered it, etc. I'll scrub some examples... I need to find the numeric RecordID in here (it references another disconnected system) and ignore the rest. There's a remote possibility it could have 2 record numbers (could be referenced in two different tables) - if so, I need to pick the more important one.
Example Data (scrubbed):
Sanchez, Raul - POS - OS - 489-849-7894 - AB - Conf >> Nothing of value here/return NULL
Smith, Nancy RL 1458968-rl >> Return 1458968
Sudhu, Mandeep intake # 78596/rl >> Return 78596
1887569 Smith, Jonathan-ESK >> Return 1021015
Montana, Joe/1451252rm >> Return 1451252
Johnson, Fredy/1784569/78563/RG >> Return 1784569
My first thought was to do a bunch of nested Replace() functions to turn anything that's not a character into a standard delimiter, parse these all into separate records with split_string(), then only pick the largest number - and that would work in every instance except #5 above... but is that the best way, or is there something better? SQL Version is 2017 for reference to available functions.
This will be wrapped into a scalar valued function.
Thanks!
CodePudding user response:
Here is a way to extract all numbers.
t1 identify all non-digit characters in each text.
t2 replaces all non-digit characters with spaces.
t3 assumes that numbers might be separated by an unknown number of spaces, and replaces every sequence of spaces with a single space.
Enjoy :-)
with
t1
as
(
select record_id
,txt
,translate(txt, '1234567890', space(10)) as non_digits
from t
),
t2
as
(
select record_id
,trim(translate(txt, non_digits, space(len(non_digits)))) as spaced_numberes
from t1
),
t3
as
(
select record_id
,replace(replace(replace(spaced_numberes, ' ', ' *'), '* ', ''), '*', '') as space_delimited_numbers
from t2
)
select t3.record_id
,ss.value
from t3
cross apply string_split(space_delimited_numbers, ' ') as ss
record_id | value |
---|---|
1 | 489 |
1 | 849 |
1 | 7894 |
2 | 1458968 |
3 | 78596 |
4 | 1887569 |
5 | 1451252 |
6 | 1784569 |
6 | 78563 |
7 | 336464564 |
7 | 77667 |
7 | 45676876876 |
CodePudding user response:
One option could be to extract the first number using some MS SQL pattern-matching:
- detect the position of the first digit for the first number, using
PATINDEX
- detect the position of the last digit for the first number, using
PATINDEX
- extract the number using
SUBSTRING
SELECT string,
SUBSTRING(string,
PATINDEX('%[0-9]%' , string),
1 PATINDEX('%[0-9][^0-9]%', string)-PATINDEX('%[0-9]%',string))
FROM tab
Once you have this, you could do some cross checks on these numbers.
Check the demo here.
Note: this solution assumes that your interesting value, if any, is found as first number, condition tweakable with a better pattern matching given the domain of your values.
Does it help for your case?
CodePudding user response:
I'm too new to upvote - that's funny! Not sure how to make sure I give proper credit and still leave this open to further critique so let me know if I need to do something...
Anyway I took David's suggestion and turned it into this:
DECLARE @String as varchar(255)
SET @String = 'Montana, Joe/14510252rm/8456,71253|'
;
WITH cte1 AS
(
SELECT * FROM (SELECT @String AS String) T
)
, cte2 AS
(
SELECT String, translate(string, '1234567890', space(10)) AS non_digits
FROM cte1
)
, cte3 AS
(
SELECT string, non_digits
,trim(translate(String, non_digits, space(len(non_digits)))) as spaced_numbers
FROM cte2
)
, cte4 AS
(
SELECT string, replace(replace(replace(spaced_numbers, ' ', ' *'), '* ', ''), '*', '') as space_delimited_numbers
FROM cte3
)
, cte5 AS
(
SELECT string , cast(ss.value as int) result
FROM cte4
CROSS APPLY string_split(space_delimited_numbers,' ') as ss
)
SELECT TOP 1 result
FROM cte5
WHERE (result between 1000000 AND 2000000) OR (result between 70000 AND 199999) ORDER BY result DESC