i have a dataframe where within the raw text column certain text with Dates in different format is given. i am looking to extract this dates in separate column
sample Raw Text :
"Sales Assistant @ DFS Duration - June 2021 - 2023 Currently working in XYZ Within the role I am expected to achieve sales targets which I currently have no problems reaching. Job Role/Establishment - Plasterer @ XX Plasterer’s Duration - September 2016 - Nov 2016 Job Role/Establishment - Customer Advisor @ AA Duration - (2015 – 2016) Job Role/Establishment - Warehouse Operative @ xyz Duration - 03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment - Airport Terminal Assistant @ port Duration - 01/2012 - 06/2013 Working at the airport . Job Role/Establishment - Apprentice Floorer @ YY Floors Duration - DEC 2010 – APRIL 2012 "
Expected Dataframe :
id Raw_text Dates
01 "sample_raw_text" June 2021 - 2023 , September 2016 - Nov 2016,(2015 – 2016),03/2014 to 08/2015 , 01/2012 - 06/2013, DEC 2010 – APRIL 2012
I have Tried below pattern :
def extract_dates(df, column):
# Define the regex pattern to match dates in different month formats
pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}\s*[-–]\s*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{1,2}[-,\s]*(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[-,\s]*\d{2,4}'
# Extract the dates from the specified column
df['Dates'] = df[column].str.extract(pattern)
with above i am unable to fetch required output. please guide what am i missing
CodePudding user response:
Try this:
\(?(?:\b[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?\s*(?:–|-|[Tt][Oo])\s*\(?(?:[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d{3}\)?|\(\s*[A-Za-z]{3,9}\s*[–-]\s*[A-Za-z]{3,9}\s*[12]\d{3}\s*\)
\(?
an optional(
.(?:[A-Za-z]{3,9}\s*)?
non-capturing gruop.[A-Za-z]{3,9}
between3-9
letters.\s*
zero or more whitespace character.?
makes the whole group optinal.
(?:\d\d\/)?
non-caputring group.\d
a digit between0-9
.\d
another digit between0-9
.\/
a literal forward slash/
.
[12]\d{3}\s*
[12]
match one digit from the listed digits1
or2
.\d{3}
three digits between0-9
\s*
zero or more whitespace character.
(?:–|-|[Tt][Oo])\s*
(?:–|-|[Tt][Oo])
match–
,-
,TO
,to
,To
ortO
.\s*
zero or more whitespace character.
(?:[A-Za-z]{3,9}\s*)?
explained above.(?:\d\d\/)?
explained above.[12]\d{3}
explained above.\)?
an optional)
.
See regex demo