I have the following - already sorted - pandas dataframe:
instrumentExtId Date proxyMethod isForceXS xsValue curveValue
.ID1 2008-03-28 00:00:00 CrossSectional FALSE 6.86046681 6.86046681
.ID1 2008-03-31 00:00:00 CrossSectional FALSE 6.97468855 6.97468855
.ID1 2008-04-01 00:00:00 CrossSectional FALSE 6.83893432 6.83893432
.ID1 2008-04-02 00:00:00 CrossSectional FALSE 6.70250452 6.70250452
.ID2 2008-03-28 00:00:00 CrossSectional FALSE 3.10441877 3.10441877
.ID2 2008-03-31 00:00:00 CrossSectional FALSE 3.5104612 3.5104612
.ID2 2008-04-01 00:00:00 CrossSectional FALSE 3.52994089 3.52994089
.ID2 2008-04-02 00:00:00 CrossSectional FALSE 3.24236585 3.24236585
For each ID and for each date, I want to apply the inverse hyperbolic sine function (np.arcsinh
) on columns "xsValue" and "curveValue" and then, calculate the date to date difference for each ID. I want columns "proxyMethod" and "isForceXS" to be preserved.
I wrote the following code, but it seems that for the first row of the second ID, the difference is calculated between the first observation of the second ID and the last observation of the first ID. I was expecting a nan
there (i.e. what I would like to see). What do I miss?
df = df.groupby(['instrumentExtId', 'Date', 'proxyMethod', 'isForceXS'], group_keys = True)[["xsValue","curveValue"]].\
apply(lambda x:np.arcsinh(x)). \
diff(). \
reset_index().\
drop(["level_4"], axis=1)
instrumentExtId Date proxyMethod isForceXS xsValue curveValue
.ID1 2008-03-28 00:00:00 CrossSectional FALSE nan nan
.ID1 2008-03-31 00:00:00 CrossSectional FALSE 0.016342295 0.016342295
.ID1 2008-04-01 00:00:00 CrossSectional FALSE -0.01945289 -0.01945289
.ID1 2008-04-02 00:00:00 CrossSectional FALSE -0.019934368 -0.019934368
.ID2 2008-03-28 00:00:00 CrossSectional FALSE -0.750188187 -0.75018818
.ID2 2008-03-31 00:00:00 CrossSectional FALSE 0.117631041 0.117631041
.ID2 2008-04-01 00:00:00 CrossSectional FALSE 0.005323083 0.005323083
.ID2 2008-04-02 00:00:00 CrossSectional FALSE -0.081488875 -0.081488875
CodePudding user response:
I think if you break apart that line into separate steps and print the output after each step, you'll figure out what's going wrong.
If I understand your intention correctly:
- It sounds like you should apply
archsinh
BEFORE grouping. - You should include only
instrumentExtId
in thegroupby
, nothing else.
I think this code will achieve what you want, unless I misunderstand your goal:
from io import StringIO
import pandas as pd
s = """\
instrumentExtId Date proxyMethod isForceXS xsValue curveValue
.ID1 2008-03-28 00:00:00 CrossSectional FALSE 6.86046681 6.86046681
.ID1 2008-03-31 00:00:00 CrossSectional FALSE 6.97468855 6.97468855
.ID1 2008-04-01 00:00:00 CrossSectional FALSE 6.83893432 6.83893432
.ID1 2008-04-02 00:00:00 CrossSectional FALSE 6.70250452 6.70250452
.ID2 2008-03-28 00:00:00 CrossSectional FALSE 3.10441877 3.10441877
.ID2 2008-03-31 00:00:00 CrossSectional FALSE 3.5104612 3.5104612
.ID2 2008-04-01 00:00:00 CrossSectional FALSE 3.52994089 3.52994089
.ID2 2008-04-02 00:00:00 CrossSectional FALSE 3.24236585 3.24236585
"""
df = pd.read_csv(StringIO(s), sep='\s\s ', engine='python')
df = df.sort_values(['instrumentExtId', 'Date'])
df['xs'] = np.arcsinh(df['xsValue'])
df['curve'] = np.arcsinh(df['curveValue'])
df[['xs_diff', 'curve_diff']] = df.groupby('instrumentExtId')[['xs', 'curve']].diff()
print(df)
instrumentExtId Date proxyMethod isForceXS xsValue curveValue xs curve xs_diff curve_diff
0 .ID1 2008-03-28 00:00:00 CrossSectional False 6.860467 6.860467 2.624193 2.624193 NaN NaN
1 .ID1 2008-03-31 00:00:00 CrossSectional False 6.974689 6.974689 2.640535 2.640535 0.016342 0.016342
2 .ID1 2008-04-01 00:00:00 CrossSectional False 6.838934 6.838934 2.621082 2.621082 -0.019453 -0.019453
3 .ID1 2008-04-02 00:00:00 CrossSectional False 6.702505 6.702505 2.601148 2.601148 -0.019934 -0.019934
4 .ID2 2008-03-28 00:00:00 CrossSectional False 3.104419 3.104419 1.850959 1.850959 NaN NaN
5 .ID2 2008-03-31 00:00:00 CrossSectional False 3.510461 3.510461 1.968590 1.968590 0.117631 0.117631
6 .ID2 2008-04-01 00:00:00 CrossSectional False 3.529941 3.529941 1.973914 1.973914 0.005323 0.005323
7 .ID2 2008-04-02 00:00:00 CrossSectional False 3.242366 3.242366 1.892425 1.892425 -0.081489 -0.081489
PS -- In your example data, xsValue
and curveValue
are identical. Was that intentional?