I have a pandas df as below:
Name | Date1 | Date2 |
---|---|---|
One | 199007 | 199010 |
One | 199206 | 199206 |
One | 199505 | 199505 |
Two | 19880701 | 19880701 |
Two | 19980704 | 19980704 |
Three | 2020 | 2020 |
Three | 2022 | 2022 |
Dates could be in the following format: (yyyy-mm) (yyyymmdd) (yyyymm) (yyyy) (yyyy-mm-dd)
The requirement is - based on Name column -from Date1 column find the minimum value and from Date2 column find the maximum value. Expected Data will look like this:
Name | Date1 | Date2 |
---|---|---|
One | 199007 | 199505 |
Two | 19880701 | 19980704 |
Three | 2020 | 2022 |
I have tried reading date1 and date2 columns as below: df['date1']=pd.to_datetime(df['date1'])
But it throws error - month must be in 1...12 199007
when i run this line of code: df['date1']=pd.to_datetime(df['date1'],format='%Y%m%d',errors=ignore)
but this is only ignoring error and nothing changes in the pandas df.
What i am trying to do here is first read date1 and date2 column as datetime format and then try finding min and max values for date1 and date2 column clubbing duplicate Name.
CodePudding user response:
You can use .groupby and .agg. Then sort Name column from a custom list. Finally convert Date columns to datetime.
df = df.groupby("Name").agg({"Date1": "min", "Date2": "max"}).reset_index()
sort_list = ["One", "Two", "Three"]
list_series = pd.Series(range(len(sort_list)), index=sort_list)
df = df.sort_values("Name", key=lambda x: x.map(list_series)).reset_index(drop=True)
date_columns = [x for x in df.columns[df.columns.str.contains("Date")]]
for column in date_columns:
df[column] = (
pd.to_datetime(df[column], format="%Y%m%d", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y%m", errors="coerce"))
.fillna(pd.to_datetime(df[column], format="%Y", errors="coerce"))
)
print(df)
Name Date1 Date2
0 One 1990-07-01 1995-05-01
1 Two 1988-07-01 1998-07-04
2 Three 2020-01-01 2022-01-01