How can I change the data type of fields in a column to date type if the data types are as below:
<class 'datetime.datetime'> 296
<class 'str'> 116
<class 'float'> 8
My aim is ignore the empty rows, duplicate the rows with multiple dates and split them out, and convert the datetime values to dates.
A small section of the column to illustrate what the data looks like (Second row is empty for Event Date column):
Col1 | Event Date |
---|---|
1 | 2020-07-16 00:00:00 |
2 | |
3 | 31/03/2022, 26/11/2018, 31/01/2028 |
I've tried a number of things to get this to work but have had no luck. I tried looping through the rows to convert each row but looping isn't the best option. I tried to split and explode the cells with multiple dates as below but this errors (with dateutil.parser._parser.ParserError: Unknown string format: 31/03/2022, 26/11/2018, 31/01/2028 present at position 3).
df=auto_test_file.assign(dates=auto_test_file['Event Date'].str.split(',')).explode('dates')
pd.to_datetime(df['Event Date'])
CodePudding user response:
You could you explode
df=df.assign(dates=df['Event Date'].str.split(',')).explode('dates')
df
Out[93]:
Col1 Event Date dates
0 1 2020-07-16 00:00:00 2020-07-16 00:00:00
1 2 NaN NaN
2 3 31/03/2022, 26/11/2018, 31/01/2028 31/03/2022
2 3 31/03/2022, 26/11/2018, 31/01/2028 26/11/2018
2 3 31/03/2022, 26/11/2018, 31/01/2028 31/01/2028
then convert to datetime
pd.to_datetime(df.dates)
Out[94]:
0 2020-07-16
1 NaT
2 2022-03-31
2 2018-11-26
2 2028-01-31
Name: dates, dtype: datetime64[ns]
CodePudding user response:
Suggested code
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Col1': [1, 2, 3],
'Event': ['2020-07-16 00:00:00','' , '31/03/2022, 26/11/2018, 31/01/2028'],
})
# Col1 Event
# 0 1 2020-07-16 00:00:00
# 1 2
# 2 3 31/03/2022, 26/11/2018, 31/01/2028
# 1 - Split inside each Event row
df['Event'] = df['Event'].apply(lambda r:r.split(','))
# 2 - Explode and the reindex
df = df.explode(column='Event').reset_index(drop=True)
# 3- Replace '' by NAN
df.replace(to_replace='', value= np.nan, inplace=True)
# 4 - Suppress rows with NAN
df.dropna(inplace=True)
# 5 - Convert to date
df['Event'] = pd.to_datetime(df['Event']).dt.date
Output
Col1 Event
0 1 2020-07-16
2 3 2022-03-31
3 3 2018-11-26
4 3 2028-01-31