I have a table of data below: (the 1st column is date, the 2nd column is the daily return)
2020-01-02 0.022034
2020-01-03 -0.002666
2020-01-06 0.009716
2020-01-07 0.009838
2020-01-08 -0.011690
2020-01-09 0.025103
2020-01-10 0.009325
2020-01-13 0.028888
2020-01-14 -0.009183
2020-01-15 0.012292
2020-01-16 -0.005593
2020-01-17 0.020492
2020-01-20 -0.003878
2020-01-21 -0.032687
2020-01-22 0.034887
2020-01-23 -0.033485
2020-01-24 0.001934
2020-01-29 -0.026629
2020-01-30 -0.039513
2020-01-31 -0.001845
2020-02-03 0.021784
2020-02-04 0.033137
2020-02-05 0.000586
2020-02-06 0.016146
2020-02-07 0.000082
2020-02-10 -0.016997
2020-02-11 0.010172
2020-02-12 0.016836
2020-02-13 0.013530
...
2022-01-31 0.031707
2022-02-04 0.028683
2022-02-07 -0.015853
2022-02-08 -0.024170
2022-02-09 0.045076
2022-02-10 0.013623
2022-02-11 -0.012259
2022-02-14 -0.023093
2022-02-15 -0.008984
2022-02-16 0.023177
2022-02-17 0.003182
2022-02-18 -0.054995
2022-02-21 -0.033302
2022-02-22 -0.028148
2022-02-23 0.012332
2022-02-24 -0.048095
2022-02-25 -0.004944
2022-02-28 -0.002682
2022-03-01 0.006940
2022-03-02 0.002542
2022-03-03 -0.006318
2022-03-04 -0.048641
2022-03-07 -0.050231
2022-03-08 -0.015469
2022-03-09 0.011477
2022-03-10 -0.002236
2022-03-11 -0.038740
2022-03-14 -0.115421
2022-03-15 -0.089573
2022-03-16 0.243084
I want to build a frequency table like below:
I think this might involve several steps: (1) categorise daily return data into different ranges (2) use value_counts() on the ranges (3) calculate the percentage on the ranges
For the first step, I think I can try pd.cut
with a groupby
. However, my dataframe doesnt have a header, and I tried portret_df.columns = ['Dates','Daily Return']
but could not manage to add the header. May I ask how can I add a header there so that I can refer to the 1st and 2nd columns?
Much appreciated for your help.
CodePudding user response:
Let's use pd.Series.value_counts
with 'bins' parameter.
bins = [-np.inf,-.01,-.005,0,.005,.01,np.inf]
labels = ['ret < -1%',
'-1% < ret < -.5%',
'-.5% < ret < %0',
' %0 < ret < .5%',
'.5% < ret < 1%',
'reg > 1%']
df_counts = (df['ret'].value_counts(bins=bins, sort=False)
.rename('# of events').to_frame().set_axis(labels).T)
df_pcts = (df['ret'].value_counts(bins=bins, normalize=True, sort=False)
.rename('% of events').to_frame()
.set_axis(labels).T.mul(100).round(1))
pd.concat([df_counts, df_pcts])
Output:
ret < -1% -1% < ret < -.5% -.5% < ret < %0 %0 < ret < .5% .5% < ret < 1% reg > 1%
# of events 20.0 4.0 6.0 5.0 4.0 20.0
% of events 33.9 6.8 10.2 8.5 6.8 33.9
CodePudding user response:
I have no idea why you have problem to change headers - maybe you read it as single column.
I have no problem to set headers when I load
df = pd.read_csv(..., names=['Date','Daily Return'])
or later
df.columns = ['Date','Daily Return']
And later I can use cut
with bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val]
min_val = df['Daily Return'].min() - 1
max_val = df['Daily Return'].max() 1
regions = pd.cut(df['Daily Return'],
bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val],
labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
)
And calculate number of events
count = regions.value_counts(sort=False)
print(count)
ret < -1(%) 0
-1 < ret < -0.5(%) 0
-0.5 < ret < 0(%) 30
0 < ret < 0.5(%) 29
0.5 < ret < 1(%) 0
ret > 1(%) 0
And use it to calculate percentage
size = len(regions)
percentage = (count/size) * 100
print(percentage)
ret < -1(%) 0.000000
-1 < ret < -0.5(%) 0.000000
-0.5 < ret < 0(%) 50.847458
0 < ret < 0.5(%) 49.152542
0.5 < ret < 1(%) 0.000000
ret > 1(%) 0.000000
Now it would need only to put all in DataFrame to format table
results = pd.DataFrame({'# of event': count, '% of event': percentage})
print(results.T.to_string())
ret < -1(%) -1 < ret < -0.5(%) -0.5 < ret < 0(%) 0 < ret < 0.5(%) 0.5 < ret < 1(%) ret > 1(%)
# of event 0.0 0.0 30.000000 29.000000 0.0 0.0
% of event 0.0 0.0 50.847458 49.152542 0.0 0.0
Full working code with example data readed with io.StringIO
but you should use own method to get data
text = '''2020-01-02 0.022034
2020-01-03 -0.002666
2020-01-06 0.009716
2020-01-07 0.009838
2020-01-08 -0.011690
2020-01-09 0.025103
2020-01-10 0.009325
2020-01-13 0.028888
2020-01-14 -0.009183
2020-01-15 0.012292
2020-01-16 -0.005593
2020-01-17 0.020492
2020-01-20 -0.003878
2020-01-21 -0.032687
2020-01-22 0.034887
2020-01-23 -0.033485
2020-01-24 0.001934
2020-01-29 -0.026629
2020-01-30 -0.039513
2020-01-31 -0.001845
2020-02-03 0.021784
2020-02-04 0.033137
2020-02-05 0.000586
2020-02-06 0.016146
2020-02-07 0.000082
2020-02-10 -0.016997
2020-02-11 0.010172
2020-02-12 0.016836
2020-02-13 0.013530
2022-01-31 0.031707
2022-02-04 0.028683
2022-02-07 -0.015853
2022-02-08 -0.024170
2022-02-09 0.045076
2022-02-10 0.013623
2022-02-11 -0.012259
2022-02-14 -0.023093
2022-02-15 -0.008984
2022-02-16 0.023177
2022-02-17 0.003182
2022-02-18 -0.054995
2022-02-21 -0.033302
2022-02-22 -0.028148
2022-02-23 0.012332
2022-02-24 -0.048095
2022-02-25 -0.004944
2022-02-28 -0.002682
2022-03-01 0.006940
2022-03-02 0.002542
2022-03-03 -0.006318
2022-03-04 -0.048641
2022-03-07 -0.050231
2022-03-08 -0.015469
2022-03-09 0.011477
2022-03-10 -0.002236
2022-03-11 -0.038740
2022-03-14 -0.115421
2022-03-15 -0.089573
2022-03-16 0.243084
'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text), sep='\s ', names=['date', 'value'])
df.columns = ['Date','Daily Return']
#print(df)
min_val = df['Daily Return'].min() - 1
max_val = df['Daily Return'].max() 1
regions = pd.cut(df['Daily Return'],
bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val],
labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
)
count = regions.value_counts(sort=False)
print(count)
size = len(regions)
percentage = (count/size) * 100
print(percentage)
results = pd.DataFrame({'# of event': count, '% of event': percentage})
print(results.T.to_string())
EDIT:
As @tdy suggests in comment you can also use -np.inf
, np.inf
instead of min_val
, max_val
import numpy as np
regions = pd.cut(df['Daily Return'],
bins=[-np.inf, -1, -0.5, 0, 0.5, 1, np.inf],
labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
)