Let me first share a text:
I am Fox Sin of Greed came on Earth in 1666 BC. due date right after
St. P was build in 16.05.1703 and bluh bluh I moved to Moscow Feb
2nd, 2022 to work as per deadline And today I read manga Due date for
my project is September 12, 2022 I wonder if Ill be able to pay by Oct
07, 2023 and so The deadline is unknown by I assume would be 9102023
Bluh bluh Due Date 12-11-2022 30/08/2021 and 9/19/23
This is a randomly generated text to test dateparser and regex. I wrote a function that is pretty good at recognising dates with regex, but excluding those that are in format [month as letters] [date as number], [year as number] This is where I usually use dateparser as it's capable of recognising those.. However, when there are 'trigger words' such as 'may' 'to pay'(??) and such it fails. Example:
I moved to Moscow Feb 2nd, 2022 to work as per deadline
[('to', datetime.datetime(2022, 9, 8, 0, 0)), ('Feb 2nd, 2022 to', datetime.datetime(2022, 2, 2, 0, 0))]
This is good. It regognised ''Feb 2nd, 2022' even tho added 'to' to 'it'.
But next one:
I wonder if Ill be able to pay by Oct 07, 2023 and so
[('to pay', datetime.datetime(2022, 9, 8, 0, 0)), ('07, 2023', datetime.datetime(2023, 7, 8, 0, 0))]
it failed to connect october to '07, 2023'.
This is used in extracting data from invoices and I have no control over in which formats dates come, so I was wondering if more experienced/skilled dateparser (possibly other python tools) users can help me avoid this problem. Rn it seems to me that I need to avoid words such as 'may', 'to pay', 'now' etc.
CodePudding user response:
If you know language of target text, you might provide it, which should prevent problems caused by bad language guess. After specifying language en
I get one date as expected that is
from dateparser.search import search_dates
print(search_dates('I wonder if Ill be able to pay by Oct 07, 2023 and so',languages=['en']))
gives output
[('by Oct 07, 2023 and', datetime.datetime(2023, 10, 7, 0, 0))]
Nonetheless docs claims that
Warning Support for searching dates is really limited and needs a lot of improvement
so you should be prepared that you might still get results not as desired.