Home > Blockchain >  Extract words (date) from free text
Extract words (date) from free text

Time:08-01

I am trying to extract the date part (so the May 2020) from this column and put in the form mm/yyyy in a new column for another table. But because this column can be free text, it can look different.

1.The last date was when I was 5 years old in May 2020.

2.In 05/1999, I was there.

Expected result for my new field:

05/2020

05/1999

I honestly just want to be able to cover ground as much as I can, I'm not sure how to extract the year from the text which is the biggest problem though because sometimes there can be numbers before and after the year like in 1.

I am using SQL Server Management Studio for this BTW.

CodePudding user response:

Here is a working but silly solution building upon Stu's idea. The question is not really answerable in any guaranteed way anyway, so this was just an exercise for fun.

I build a table of months that I will recognise, and a utility table of integers up to 1024.

In the cleansed CTE I use a strip out any characters that aren't either letters, numbers or a space character by checking each character individually with the help of the ints CTE, and then put the clean letters back together with string_agg.

Since I know there are no longer any symbols in the string, this allows me to tag the start of each word with my own identifiable symbol, a leading minus sign. This prevents partial word matches with charindex in the next step.

In the words CTE, I break the string into words using string_split, using charindex to get the word position. Since charindex will get the first whole word match, and there might be more than one whole word match, I reverse both terms. The presumption here is that we prefer a "month like" component that appears prior to a "year-like" component.

Finally, I look for a ("month-like", "year-like") pair, prioritising pairs where the month appears before the year, and then on the smallest separation between month and year.

Honestly, the biggest gap in the logic comes from the fact that string_split doesn't return any position information, which is dumb, forcing a reliance on charindex, which is not reliable in the case of duplicate words. (string_split may apparently be fixed in SQL Server 2022, in the meantime you can roll your own split and get rid of the charindex logic for word order).

I do not recommend running this against a large number of sentences.

create table #sentences
(
   sentence varchar(1024)
);

delete from #sentences;

insert #sentences values
('The last date was when I was 5 years old in May 2020'),
('In 05/1999, I was there.'),
('expect no result, no month here 2022'),
('expect no result, no year here may'),
('expect no result, no year here 05'),
('expect no result, 123456789'),
('expect no result, 01 02 03 04'),
('expect feb2022 result, leading month prioritised over trailing month, 02 2022 01'),
('expect jun2022 result, 6 5 6 2022'),
('expect aug2022 result, in August 2022 I ate 2000 jelly beans'),
('expect aug2022 result, in August 2022 I may have eaten 2000 jelly beans');


with months as 
(
   select   *
   from     (
               values (1,  '-Jan', '-January'),
                      (2,  '-Feb', '-February'),
                      (3,  '-Mar', '-March'),
                      (4,  '-Apr', '-April'),
                      (5,  '-May', '-May'),
                      (6,  '-Jun', '-June'),
                      (7,  '-Jul', '-July'),
                      (8,  '-Aug', '-August'),
                      (9,  '-Sep', '-September'),
                      (10, '-Oct', '-October'),
                      (11, '-Nov', '-November'),
                      (12, '-Dec', '-December')
            ) t (monthNum, shortName, longName)
),

ints as
(
   select top 1024 rn = row_number() over (order by message_id) from sys.messages
),

cleansed as
(
   select      sentence =  '-' 
                             replace
                           (
                              string_agg(t.c, '') within group (order by i.rn),
                              ' ', 
                              ' -'
                           )
   from        ints        i
   cross apply (
                  select      sentence,
                              c = iif(t.c like '[0-9a-Z ]', t.c, ' ')
                  from        #sentences 
                  cross apply ( values (substring(sentence, i.rn, 1))) t(c)
                  where       i.rn <= len(sentence)
                              and sentence like '%[1-2][0-9][0-9][0-9]%'
               ) t   
   group by    sentence
),

words as
(
   select      sentence, 
               t.word, 
               pos = row_number() over (partition by sentence order by u.pos desc)
   from        cleansed
   cross apply (select * from string_split(sentence, ' ')) t(word)   
   cross apply (select charindex(reverse(t.word), reverse(sentence))) u(pos)
),
priortisedMatches as
(
   select      [priority]     =  row_number() over (partition by w.sentence order by t.separation asc),
               [MM/yyyy]      =  concat
                                 (
                                    format(t.monthNum, '00/'), 
                                    replace(w.word, '-', '')
                                 ),
               cleanSentence  =  replace(w.sentence, '-', '')
   from        words w
   cross apply (
                  select   top 1 
                           m.monthNum,
                           separation = abs(w1.pos - w.pos)
                  from     words  w1
                  join     months m on w1.word in (m.shortName, m.longName) 
                                       or -try_cast(word as int) = m.monthNum
                  where    w1.sentence = w.sentence
                  order by sign(w1.pos - w.pos) asc,
                           abs(w1.pos - w.pos) asc
               ) t 
   where       w.word like '-[1-2][0-9][0-9][0-9]'
)

select [MM/yyyy], cleanSentence
from   priortisedMatches
where  [priority] = 1;

/* produces:

08/2022     expect aug2022 result  in August 2022 I ate 2000 jelly beans
08/2022     expect aug2022 result  in August 2022 I may have eaten 2000 jelly beans
02/2022     expect feb2022 result  leading month prioritised over trailing month  02 2022 01
06/2022     expect jun2022 result  6 5 6 2022
05/1999     In 05 1999  I was there 
05/2020     The last date was when I was 5 years old in May 2020
*/
  • Related