I have a df
id date eventdate
A 2020Q1 2020Q3
A 2020Q2 2020Q3
A 2020Q3 2020Q3
A 2020Q4 2020Q3
B 2019Q1 2019Q2
B 2019Q2 2019Q2
B 2019Q3 2019Q2
B 2019Q4 2019Q2
C 2020Q1 NaN
C 2020Q2 NaN
C 2020Q3 NaN
C 2020Q4 NaN
D 2019Q2 NaN
D 2019Q3 NaN
D 2019Q4 NaN
...
I want to calculate the time index around the event date such that if both dates match, t=0, and -t and t around the event date.
desired df
id date eventdate t
A 2020Q1 2020Q3 -2
A 2020Q2 2020Q3 -1
A 2020Q3 2020Q3 0
A 2020Q4 2020Q3 1
B 2019Q1 2019Q2 -1
B 2019Q2 2019Q2 0
B 2019Q3 2019Q2 1
B 2019Q4 2019Q2 2
C 2020Q1 NaN NaN
C 2020Q2 NaN NaN
C 2020Q3 NaN NaN
C 2020Q4 NaN NaN
D 2019Q2 NaN NaN
D 2019Q3 NaN NaN
D 2019Q4 NaN NaN
So i use the following code
df['t'] = (pd.PeriodIndex(df['date'], freq='Q').astype('int') - pd.PeriodIndex(df['eventdate'], freq='Q').astype('int'))
and i get strangely big negative numbers for which eventdates aren't available
count 8.812475e 06
mean -4.565033e 18
std 4.611450e 18
min -9.223372e 18
25% -9.223372e 18
50% -8.000000e 00
75% 3.000000e 00
max 2.300000e 01
Name: t, dtype: float64
Any idea what's causing this issue? thanks.
CodePudding user response:
Problem is that pd.PeriodIndex
converts NaN
value to NaT
. When you convert a NaT
value to int with .astype('int')
, it gives -9223372036854775808
.
You can check the data is NaT
when accessing the n
attribute
df['t'] = ((pd.PeriodIndex(df['date'], freq='Q') -
pd.PeriodIndex(df['eventdate'], freq='Q'))
.map(lambda x: np.nan if x != x else x.n))
# or use pd.isna
# .map(lambda x: np.nan if pd.isna(x) else x.n)
Or mask the NaN
value
df['t'] = ((pd.PeriodIndex(df['date'], freq='Q').astype('int') -
pd.PeriodIndex(df['eventdate'], freq='Q').astype('int'))
.to_series(range(len(df))).mask(df[['date', 'eventdate']].isna().any(axis=1)))
print(df)
id date eventdate t
0 A 2020Q1 2020Q3 -2.0
1 A 2020Q2 2020Q3 -1.0
2 A 2020Q3 2020Q3 0.0
3 A 2020Q4 2020Q3 1.0
4 B 2019Q1 2019Q2 -1.0
5 B 2019Q2 2019Q2 0.0
6 B 2019Q3 2019Q2 1.0
7 B 2019Q4 2019Q2 2.0
8 C 2020Q1 NaN NaN
9 C 2020Q2 NaN NaN
10 C 2020Q3 NaN NaN
11 C 2020Q4 NaN NaN
12 D 2019Q2 NaN NaN
13 D 2019Q3 NaN NaN
14 D 2019Q4 NaN NaN