How to get min date and max date from pandas df based on another column-CodePudding

I have a pandas df as below:

Name	Date1	Date2
One	199007	199010
One	199206	199206
One	199505	199505
Two	19880701	19880701
Two	19980704	19980704
Three	2020	2020
Three	2022	2022

Dates could be in the following format: (yyyy-mm) (yyyymmdd) (yyyymm) (yyyy) (yyyy-mm-dd)

The requirement is - based on Name column -from Date1 column find the minimum value and from Date2 column find the maximum value. Expected Data will look like this:

Name	Date1	Date2
One	199007	199505
Two	19880701	19980704
Three	2020	2022

I have tried reading date1 and date2 columns as below: df['date1']=pd.to_datetime(df['date1'])

But it throws error - month must be in 1...12 199007

when i run this line of code: df['date1']=pd.to_datetime(df['date1'],format='%Y%m%d',errors=ignore)

but this is only ignoring error and nothing changes in the pandas df.

What i am trying to do here is first read date1 and date2 column as datetime format and then try finding min and max values for date1 and date2 column clubbing duplicate Name.

CodePudding user response：

You can use .groupby and .agg. Then sort Name column from a custom list. Finally convert Date columns to datetime.

df = df.groupby("Name").agg({"Date1": "min", "Date2": "max"}).reset_index()

sort_list = ["One", "Two", "Three"]
list_series = pd.Series(range(len(sort_list)), index=sort_list)
df = df.sort_values("Name", key=lambda x: x.map(list_series)).reset_index(drop=True)

date_columns = [x for x in df.columns[df.columns.str.contains("Date")]]
for column in date_columns:
    df[column] = (
        pd.to_datetime(df[column], format="%Y%m%d", errors="coerce")
        .fillna(pd.to_datetime(df[column], format="%Y%m", errors="coerce"))
        .fillna(pd.to_datetime(df[column], format="%Y", errors="coerce"))
    )

print(df)

    Name      Date1      Date2
0    One 1990-07-01 1995-05-01
1    Two 1988-07-01 1998-07-04
2  Three 2020-01-01 2022-01-01