Home > OS >  dateparser python ignore trigger words
dateparser python ignore trigger words

Time:09-08

Let me first share a text:

I am Fox Sin of Greed came on Earth in 1666 BC. due date   right after
St. P was build in 16.05.1703 and bluh bluh  I moved to Moscow Feb
2nd, 2022 to work as per deadline  And today I read manga Due date for
my project is September 12, 2022 I wonder if Ill be able to pay by Oct
07, 2023 and so  The deadline is unknown by I assume would be 9102023
Bluh bluh Due Date 12-11-2022 30/08/2021 and 9/19/23

This is a randomly generated text to test dateparser and regex. I wrote a function that is pretty good at recognising dates with regex, but excluding those that are in format [month as letters] [date as number], [year as number] This is where I usually use dateparser as it's capable of recognising those.. However, when there are 'trigger words' such as 'may' 'to pay'(??) and such it fails. Example:

I moved to Moscow Feb 2nd, 2022 to work as per deadline

 [('to', datetime.datetime(2022, 9, 8, 0, 0)), ('Feb 2nd, 2022 to', datetime.datetime(2022, 2, 2, 0, 0))]

This is good. It regognised ''Feb 2nd, 2022' even tho added 'to' to 'it'.

But next one:

I wonder if Ill be able to pay by Oct 07, 2023 and so

[('to pay', datetime.datetime(2022, 9, 8, 0, 0)), ('07, 2023', datetime.datetime(2023, 7, 8, 0, 0))]

it failed to connect october to '07, 2023'.

This is used in extracting data from invoices and I have no control over in which formats dates come, so I was wondering if more experienced/skilled dateparser (possibly other python tools) users can help me avoid this problem. Rn it seems to me that I need to avoid words such as 'may', 'to pay', 'now' etc.

CodePudding user response:

If you know language of target text, you might provide it, which should prevent problems caused by bad language guess. After specifying language en I get one date as expected that is

from dateparser.search import search_dates
print(search_dates('I wonder if Ill be able to pay by Oct 07, 2023 and so',languages=['en']))

gives output

[('by Oct 07, 2023 and', datetime.datetime(2023, 10, 7, 0, 0))]

Nonetheless docs claims that

Warning Support for searching dates is really limited and needs a lot of improvement

so you should be prepared that you might still get results not as desired.

  • Related