Home > Net >  Parsing dates in Different format from Text
Parsing dates in Different format from Text

Time:01-13

i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column

sample Raw Text :

"Sales Assistant @ DFS Duration - June 2021 - 2023 Currently working in XYZ Within the role I am expected to achieve sales targets which I currently have no problems reaching. Job Role/Establishment - Plasterer @ XX Plasterer’s Duration - September 2016 - Nov 2016 Job Role/Establishment - Customer Advisor @ AA Duration - (2015 – 2016) Job Role/Establishment - Warehouse Operative @ xyz Duration - 03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment - Airport Terminal Assistant @ port Duration - 01/2012 - 06/2013 Working at the airport . Job Role/Establishment - Apprentice Floorer @ YY Floors Duration - DEC 2010 – APRIL 2012 "

Expected Dataframe :

id      Raw_text                   Dates
01     "sample_raw_text"         June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012

I have Tried below pattern :

def extract_dates(df, column):
    # Define the regex pattern to match dates in different month formats
    pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}\s*[-–]\s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}'

    # Extract the dates from the specified column
    df['Dates'] = df[column].str.extract(pattern)

with above i am unable to fetch required output. please guide what am i missing

CodePudding user response:

Try this:

\(?(?:\b[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?\s*(?:–|-|[Tt][Oo])\s*\(?(?:[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?|\(\s*[A-Za-z]{3,9}\s*[–-]\s*[A-Za-z]{3,9}\s*[12]\d{3}\s*\)
  • \(? an optional (.

  • (?:[A-Za-z]{3,9}\s*)? non-capturing gruop.

    • [A-Za-z]{3,9} between 3-9 letters.
    • \s* zero or more whitespace character.
    • ? makes the whole group optinal.
  • (?:\d\d\/)? non-caputring group.

    • \d a digit between 0-9.
    • \d another digit between 0-9.
    • \/ a literal forward slash /.
  • [12]\d{3}\s*

    • [12] match one digit from the listed digits 1 or 2.
    • \d{3} three digits between 0-9
    • \s* zero or more whitespace character.
  • (?:–|-|[Tt][Oo])\s*

    • (?:–|-|[Tt][Oo]) match , -, TO, to, To or tO.
    • \s* zero or more whitespace character.
  • (?:[A-Za-z]{3,9}\s*)? explained above.

  • (?:\d\d\/)? explained above.

  • [12]\d{3} explained above.

  • \)? an optional ).

See regex demo

  • Related