I am trying to extract the date part (so the May 2020) from this column and put in the form mm/yyyy in a new column for another table. But because this column can be free text, it can look different.
1.The last date was when I was 5 years old in May 2020.
2.In 05/1999, I was there.
Expected result for my new field:
05/2020
05/1999
I honestly just want to be able to cover ground as much as I can, I'm not sure how to extract the year from the text which is the biggest problem though because sometimes there can be numbers before and after the year like in 1.
I am using SQL Server Management Studio for this BTW.
CodePudding user response:
Here is a working but silly solution building upon Stu's idea. The question is not really answerable in any guaranteed way anyway, so this was just an exercise for fun.
I build a table of months that I will recognise, and a utility table of integers up to 1024.
In the cleansed
CTE I use a strip out any characters that aren't either letters, numbers or a space character by checking each character individually with the help of the ints
CTE, and then put the clean letters back together with string_agg
.
Since I know there are no longer any symbols in the string, this allows me to tag the start of each word with my own identifiable symbol, a leading minus sign. This prevents partial word matches with charindex
in the next step.
In the words
CTE, I break the string into words using string_split
, using charindex
to get the word position. Since charindex
will get the first whole word match, and there might be more than one whole word match, I reverse both terms. The presumption here is that we prefer a "month like" component that appears prior to a "year-like" component.
Finally, I look for a ("month-like", "year-like") pair, prioritising pairs where the month appears before the year, and then on the smallest separation between month and year.
Honestly, the biggest gap in the logic comes from the fact that string_split
doesn't return any position information, which is dumb, forcing a reliance on charindex
, which is not reliable in the case of duplicate words. (string_split
may apparently be fixed in SQL Server 2022, in the meantime you can roll your own split and get rid of the charindex
logic for word order).
I do not recommend running this against a large number of sentences.
create table #sentences
(
sentence varchar(1024)
);
delete from #sentences;
insert #sentences values
('The last date was when I was 5 years old in May 2020'),
('In 05/1999, I was there.'),
('expect no result, no month here 2022'),
('expect no result, no year here may'),
('expect no result, no year here 05'),
('expect no result, 123456789'),
('expect no result, 01 02 03 04'),
('expect feb2022 result, leading month prioritised over trailing month, 02 2022 01'),
('expect jun2022 result, 6 5 6 2022'),
('expect aug2022 result, in August 2022 I ate 2000 jelly beans'),
('expect aug2022 result, in August 2022 I may have eaten 2000 jelly beans');
with months as
(
select *
from (
values (1, '-Jan', '-January'),
(2, '-Feb', '-February'),
(3, '-Mar', '-March'),
(4, '-Apr', '-April'),
(5, '-May', '-May'),
(6, '-Jun', '-June'),
(7, '-Jul', '-July'),
(8, '-Aug', '-August'),
(9, '-Sep', '-September'),
(10, '-Oct', '-October'),
(11, '-Nov', '-November'),
(12, '-Dec', '-December')
) t (monthNum, shortName, longName)
),
ints as
(
select top 1024 rn = row_number() over (order by message_id) from sys.messages
),
cleansed as
(
select sentence = '-'
replace
(
string_agg(t.c, '') within group (order by i.rn),
' ',
' -'
)
from ints i
cross apply (
select sentence,
c = iif(t.c like '[0-9a-Z ]', t.c, ' ')
from #sentences
cross apply ( values (substring(sentence, i.rn, 1))) t(c)
where i.rn <= len(sentence)
and sentence like '%[1-2][0-9][0-9][0-9]%'
) t
group by sentence
),
words as
(
select sentence,
t.word,
pos = row_number() over (partition by sentence order by u.pos desc)
from cleansed
cross apply (select * from string_split(sentence, ' ')) t(word)
cross apply (select charindex(reverse(t.word), reverse(sentence))) u(pos)
),
priortisedMatches as
(
select [priority] = row_number() over (partition by w.sentence order by t.separation asc),
[MM/yyyy] = concat
(
format(t.monthNum, '00/'),
replace(w.word, '-', '')
),
cleanSentence = replace(w.sentence, '-', '')
from words w
cross apply (
select top 1
m.monthNum,
separation = abs(w1.pos - w.pos)
from words w1
join months m on w1.word in (m.shortName, m.longName)
or -try_cast(word as int) = m.monthNum
where w1.sentence = w.sentence
order by sign(w1.pos - w.pos) asc,
abs(w1.pos - w.pos) asc
) t
where w.word like '-[1-2][0-9][0-9][0-9]'
)
select [MM/yyyy], cleanSentence
from priortisedMatches
where [priority] = 1;
/* produces:
08/2022 expect aug2022 result in August 2022 I ate 2000 jelly beans
08/2022 expect aug2022 result in August 2022 I may have eaten 2000 jelly beans
02/2022 expect feb2022 result leading month prioritised over trailing month 02 2022 01
06/2022 expect jun2022 result 6 5 6 2022
05/1999 In 05 1999 I was there
05/2020 The last date was when I was 5 years old in May 2020
*/