I have columns in DataFrame which consist mainly of dates. But there may also be strings or empty values. I want to extract the year from the column but get an error because of string values. Is there a way to apply universal solutions to that? Not only to get a year or month but also to apply other functions which could end with this error. I mean, I would like to understand the nature of this problem and how to deal with it.
code is like
dates={'date':['11/03/2019','12/05/2021','','11/03/2021','x'],
'date2':['11/04/2019','12/03/2021','11/06/2021',np.nan,'ab'],
}
df2=pd.DataFrame(dates)
df2['year'] =pd.DatetimeIndex(df2['date']).year
the error messages
Unknown string format: x
Thank you in advance!
CodePudding user response:
You can try this,
dates={'date':['11/03/2019','12/05/2021','','11/03/2021','x'],
'date2':['11/04/2019','12/03/2021','11/06/2021',np.nan,'ab'],
}
df =pd.DataFrame(dates)
df["date"] = pd.to_datetime(df['date'], errors = "coerce")
df["date2"] = pd.to_datetime(df['date2'], errors = "coerce")
df["year1"] = df["date"].dt.year
df["year2"] = df["date2"].dt.year
Output -
date | date2 | year1 | year2 | |
---|---|---|---|---|
0 | 2019-11-03 00:00:00 | 2019-11-04 00:00:00 | 2019.0 | 2019.0 |
1 | 2021-12-05 00:00:00 | 2021-12-03 00:00:00 | 2021.0 | 2021.0 |
2 | NaT | 2021-11-06 00:00:00 | nan | 2021.0 |
3 | 2021-11-03 00:00:00 | NaT | 2021.0 | nan |
4 | NaT | NaT | nan | nan |
If you don't want any null values in your dataframe, do df.dropna(inplace = True)
before adding the year1
and year2
columns.