Home > Back-end >  Fetching date in different formats from string in pandas dataframe
Fetching date in different formats from string in pandas dataframe

Time:01-21

below is the text from which i want to fetch the dates in different format.

"Sales Assistant @ DFS Duration - June 2021 - 2023 Currently working in XYZ Within the role I am expected to achieve sales targets which I currently have no problems reaching. Job Role/Establishment - Plasterer @ XX Plasterer’s Duration - September 2016 - Nov 2016 A Job Role/Establishment - Customer Advisor @ AA Duration - (2015 – 2016) Job Role/Establishment - Warehouse Operative @ xyz Duration - 03/2014 to 08/2015 In the xyz warehouse Job Role/Establishment - Airport Terminal Assistant @ port Duration - 01/2012 - 06/2013 Working at the airport . Job Role/Establishment - Apprentice Floorer @ YY Floors Duration - DEC 2010 to APRIL 2012a" (12/03/2020)-(2/11/2021) Fetch dates with different formats @TEST Duration - (December- March2022) and thsi is test @BLA Duration - (July-December 2019) - This is test trying to fetch dates with diff formats - 05/22 - 2023. @ Plasterer's Duration - 10/21 - 05/22 16-17 other starts from 31 september 2022 to 01 january 2023 towards ends it starts from july 2022 - january 2023 .

This is the regex logic check here to see regex pattern that matches with the majority of date format but still misses the date in 31 september 2022 to 01 january 2023 this format. Also current pattern fetches 16-17 which is not required

\(?(?:\b[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d(?:\d{2})?\)?\s*(?:–|-|[Tt][Oo])\s*\(?(?:[A-Za-z]{3,9}\s*)?(?:\d\d?\/){0,2}[12]\d(?:\d{2})?\)?|\(\s*[A-Za-z]{3,9}\s*[--]\s*[A-Za-z]{3,9}\s*[12]\d{3}\s*\)

what changes needs to be made.? any leads.? or any other efficient way to fetch the same..?

CodePudding user response:

You can make the alternatives more specific and use a case insensitive match:

\(\d\d?/\d\d?/\d{4}\)\s*[-–]\s*\(\d\d?/\d\d?/\d{4}\)|\((?:[A-Za-z]{3,9}|\d{4})\s*[-–]\s*(?:[A-Za-z]{3,9})?\s*\d{4}\)|\b\d\d?\s [A-Za-z]{3,9}\s*[-–]\s*\d{4}\s to\s \d\d?\s [A-Za-z]{3,9}\s \d{4}\b|\b[A-Za-z]{3,9}\s \d{4}\s*(?:[-–]|to)(?:\s*[A-Za-z]{3,9})?\s \d{4}|\b\d\d?/(?:\d{4}|\d\d?)\s (?:to|[-–])\s (?:\d\d?/)?(?:\d{4}|\d\d?)\b

It is a long pattern, but these are the 5 alternatives with a description.

  • \(\d\d?/\d\d?/\d{4}\)\s*[-–]\s*\(\d\d?/\d\d?/\d{4}\) Match (...)-(...) with digits and / as separator
  • | Or
  • \((?:[A-Za-z]{3,9}|\d{4})\s*[-–]\s*(?:[A-Za-z]{3,9})?\s*\d{4}\) Match (...)-(...) with leading chars a-z or 4 digits, then - followed by optional chars a-z and then 4 digits
  • | Or
  • \b\d\d?\s [A-Za-z]{3,9}\s*[-–]\s*\d{4}\s to\s \d\d?\s [A-Za-z]{3,9}\s \d{4}\b Match digits and chars a-z with - and to
  • | Or
  • \b[A-Za-z]{3,9}\s \d{4}\s*(?:[-–]|to)(?:\s*[A-Za-z]{3,9})?\s \d{4} Match chars a-z with 4 digits, to or - and optional chars followed by 4 digits
  • | Or
  • \b\d\d?/(?:\d{4}|\d\d?)\s (?:to|[-–])\s (?:\d\d?/)?(?:\d{4}|\d\d?)\b Match 1 or 2 digits followed by / and 1,2 or 4 digits. Then match to or - and again 1-2 digits / and 1,2 or 4 digits

See a regex demo.

  • Related