Home > other >  Extracting dates in a pandas dataframe column using regex
Extracting dates in a pandas dataframe column using regex

Time:02-04

I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.

start_date, end_date, days_between_start_and_end_date. 

The issue is Campaign column value is not in a fixed format, for the below values my code block works well.

1. Season1 hero (18.02. -24.03.2021)

What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.

import pandas as pd
import re
import datetime

# read csv file
df = pd.read_csv("report.csv")

# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d \.\d )\.\s*-\s*(\d \.\d \.\d )')
df['start_date'] = dates[0]
df['end_date'] = dates[1]

# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')

# Add year to start date
for index, row in df.iterrows():
    if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
        continue
    start_month = row["start_date"].month
    end_month = row["end_date"].month
    year = row["end_date"].year
    if start_month > end_month:
        year = year - 1
    dates_str = str(row["start_date"].strftime("%d.%m"))   "."   str(year)
    df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
    dates_str = str(row["end_date"].strftime("%d.%m"))   "."   str(row["end_date"].year)
    df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")

but, I have multiple different column values where my regex fail and I receive nan values, for example

1.  Sales is on (30.12.21-12.01.2022)
2.  Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3.  M SALE (19.04 - 04.05.2022)
4.  NEW SALE (29.12.2022-11.01.2023)

in all the above 4 example, my date format is completely different.

expected output

start date     end date 
2021-12-30   2022-01-22
2023-03-24   2023-03-30
2022-04-19   2022-05-04
2022-12-29   2023-01-11

Can someone please help me here?

CodePudding user response:

I would do a basic regex with res

  • Related