My pandas df3
looks like this:
df3 = pd.DataFrame([['23.02.2012', '23.02.2012', 'aaa'], ['27.02.2014', '27.02.2014', 'bbb'], ['17.08.2018', '17.08.2018', 'ccc'], ['22.07.2019', '22.07.2019', 'ddd']], columns=['date', 'period', 'text'])
I want to make column period
display the following periods if the dates correspond. Since some date
values were formatted with timestamp and some not, it didnt create the correct (without timestamp) period
values. That is why i did df3['question_date'].dt.date
df3['date'] = pd.to_datetime(df3['date'], errors = 'coerce')
df3['question_date'] = df3['question_date']
df3['period'] = df3['date']
col_name = 'period'
strt_col = df3.pop(col_name)
df3.insert(5, col_name, strt_col)
date1 = pd.Timestamp('1990-10-14').date()
date2 = pd.Timestamp('1994-11-10').date()
date3 = pd.Timestamp('1999-10-1').date()
date4 = pd.Timestamp('2004-6-13').date()
date5 = pd.Timestamp('2009-8-30').date()
date6 = pd.Timestamp('2014-10-14').date()
date7 = pd.Timestamp('2019-11-26').date()
date8 = pd.Timestamp('2021-9-20').date()
mask1 = (df3['question_date'] >= 'date1') & (df3['question_date'] < 'date2')
mask2 = (df3['question_date'] >= 'date2') & (df3['question_date'] < 'date3')
mask3 = (df3['question_date'] >= 'date3') & (df3['question_date'] < 'date4')
mask4 = (df3['question_date'] >= 'date4') & (df3['question_date'] < 'date5')
mask5 = (df3['question_date'] >= 'date5') & (df3['question_date'] < 'date6')
mask6 = (df3['question_date'] >= 'date6') & (df3['question_date'] < 'date7')
mask7 = (df3['question_date'] >= 'date7') & (df3['question_date'] < 'date8')
df3.loc[mask1, 'leg_per'] = '1990-1994'
df3.loc[mask2, 'leg_per'] = '1994-1999'
df3.loc[mask3, 'leg_per'] = '1999-2004'
df3.loc[mask4, 'leg_per'] = '2004-2009'
df3.loc[mask5, 'leg_per'] = '2009-2014'
df3.loc[mask6, 'leg_per'] = '2014-2019'
df3.loc[mask7, 'leg_per'] = '2019-2021'
.
.
.
At mask1
it throws error
TypeError: '>=' not supported between instances of 'datetime.date' and 'str'
Original question: preventing timestamp creation in to_datetime() formatting in order to group by periods
CodePudding user response:
Here's a slighly modified version of your example,
import pandas as pd
df3 = pd.DataFrame([['23.02.2012', 'aaa'], ['27.02.2014', 'bbb'],
['17.08.2018', 'ccc'], ['22.07.2019', 'ddd']],
columns=['date', 'text'])
# ensure datetime64[ns] type
df3['date'] = pd.to_datetime(df3['date'], dayfirst=True)
Note that for given date/time format in the example, you need to specify that the day comes first if you don't provide a format
. Also, use errors='coerce'
only if you're know what you're doing - I always prefer errors to be raised first.
Now let's create a mask for a certain date range. Note that pandas
is clever enough that you can use dates as strings (see also Indexing, can be applied to comparisons as well):
date1 = '2010-10-14'
date2 = '2014-11-10'
m = (df3['date'] >= '2010-10-14') & (df3['date'] < '2014-11-10')
df3.loc[m, 'leg_per'] = '2010-2014'
The dummy df then looks like
df3
date text leg_per
0 2012-02-23 aaa 2010-2014
1 2014-02-27 bbb 2010-2014
2 2018-08-17 ccc NaN
3 2019-07-22 ddd NaN
In general, if you work with date/time in pandas, and want to save yourself some trouble: use the datetime64 data type pandas has to offer. Avoid Python's datetime, date and time class. It will cause issues as the one you're experiencing and reduce functionality you have at hand.