Change year based on start and end date in dataframe-CodePudding

I had a column in data frame called startEndDate, example: '10.12-20.05.2019', divided those to columns start_date and end_date with same year, example: start_date '10.12.2019' and end_date '20.05.2019'. But year in this example is wrong, as it should be 2018 because start date cannot be after end date. How can I compare entire dataframe and replace values so it contains correct start_dates based on if statement(because some start dates should stay with year as 2019)?

CodePudding user response：

This will show you which rows the start_date is > than the end date

data = {
    'Start_Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
    'End_Date' : ['2020-02-01', '2019-01-02', '2019-01-03', '2020-01-05']
}
df = pd.DataFrame(data)
df['Start_Date'] = pd.to_datetime(df['Start_Date'], infer_datetime_format=True)
df['End_Date'] = pd.to_datetime(df['End_Date'], infer_datetime_format=True)
df['Check'] = np.where(df['Start_Date'] > df['End_Date'], 'Error', 'No Error')
df

Without seeing more of your data or your intended final data this is the best we will be able to do to help identify problems in the data.

CodePudding user response：

This method first splits up the date string to two dates and creates start and end date columns. Then it subtracts 1 year from the start date if it is greater than the end date.

import pandas as pd
import numpy as np

# mock data
df = pd.DataFrame({"dates": ["10.12-20.05.2019", "02.04-31.10.2019"]})

# split date string to two dates, convert to datetime and stack to columns
df[["start", "end"]] = np.vstack(
    df.dates.apply(lambda x: pd.to_datetime(
        [x.split("-")[0]   x[-5:],
         x.split("-")[1]], format="%d.%m.%Y")))

# subtract 1 year from start date if greater than end date
df["start"] = np.where(df["start"]>df["end"],
                       df["start"] - pd.DateOffset(years=1),
                       df["start"])

df
#              dates      start        end
#0  10.12-20.05.2019 2018-12-10 2019-05-20
#1  02.04-31.10.2019 2019-04-02 2019-10-31

Although I have used split here for the initial splitting of the string, as there will always be 5 characters before the hyphen, and the date will always be the last 5 (with the .), there is no need to use the split and instead that line could change to:

df[["start", "end"]] = np.vstack(
    df.dates.apply(lambda x: pd.to_datetime(
        [x[:5]   x[-5:],
         x[6:]], format="%d.%m.%Y")))