Home > Blockchain >  Python Matplotlib - frequency table
Python Matplotlib - frequency table

Time:03-25

I have a table of data below: (the 1st column is date, the 2nd column is the daily return)

2020-01-02    0.022034
2020-01-03   -0.002666
2020-01-06    0.009716
2020-01-07    0.009838
2020-01-08   -0.011690
2020-01-09    0.025103
2020-01-10    0.009325
2020-01-13    0.028888
2020-01-14   -0.009183
2020-01-15    0.012292
2020-01-16   -0.005593
2020-01-17    0.020492
2020-01-20   -0.003878
2020-01-21   -0.032687
2020-01-22    0.034887
2020-01-23   -0.033485
2020-01-24    0.001934
2020-01-29   -0.026629
2020-01-30   -0.039513
2020-01-31   -0.001845
2020-02-03    0.021784
2020-02-04    0.033137
2020-02-05    0.000586
2020-02-06    0.016146
2020-02-07    0.000082
2020-02-10   -0.016997
2020-02-11    0.010172
2020-02-12    0.016836
2020-02-13    0.013530
                  ...
2022-01-31    0.031707
2022-02-04    0.028683
2022-02-07   -0.015853
2022-02-08   -0.024170
2022-02-09    0.045076
2022-02-10    0.013623
2022-02-11   -0.012259
2022-02-14   -0.023093
2022-02-15   -0.008984
2022-02-16    0.023177
2022-02-17    0.003182
2022-02-18   -0.054995
2022-02-21   -0.033302
2022-02-22   -0.028148
2022-02-23    0.012332
2022-02-24   -0.048095
2022-02-25   -0.004944
2022-02-28   -0.002682
2022-03-01    0.006940
2022-03-02    0.002542
2022-03-03   -0.006318
2022-03-04   -0.048641
2022-03-07   -0.050231
2022-03-08   -0.015469
2022-03-09    0.011477
2022-03-10   -0.002236
2022-03-11   -0.038740
2022-03-14   -0.115421
2022-03-15   -0.089573
2022-03-16    0.243084

I want to build a frequency table like below: enter image description here

I think this might involve several steps: (1) categorise daily return data into different ranges (2) use value_counts() on the ranges (3) calculate the percentage on the ranges

For the first step, I think I can try pd.cut with a groupby. However, my dataframe doesnt have a header, and I tried portret_df.columns = ['Dates','Daily Return'] but could not manage to add the header. May I ask how can I add a header there so that I can refer to the 1st and 2nd columns? Much appreciated for your help.

CodePudding user response:

Let's use pd.Series.value_counts with 'bins' parameter.

bins = [-np.inf,-.01,-.005,0,.005,.01,np.inf]
labels = ['ret < -1%', 
          '-1% < ret < -.5%', 
          '-.5% < ret < %0',
          ' %0 < ret < .5%',
          '.5% < ret < 1%',
          'reg > 1%']
df_counts = (df['ret'].value_counts(bins=bins, sort=False)
                      .rename('# of events').to_frame().set_axis(labels).T)
df_pcts = (df['ret'].value_counts(bins=bins, normalize=True, sort=False)
                    .rename('% of events').to_frame()
                    .set_axis(labels).T.mul(100).round(1))

pd.concat([df_counts, df_pcts])

Output:

             ret < -1%  -1% < ret < -.5%  -.5% < ret < %0   %0 < ret < .5%  .5% < ret < 1%  reg > 1%
# of events       20.0               4.0              6.0              5.0             4.0      20.0
% of events       33.9               6.8             10.2              8.5             6.8      33.9

CodePudding user response:

I have no idea why you have problem to change headers - maybe you read it as single column.


I have no problem to set headers when I load

df = pd.read_csv(..., names=['Date','Daily Return'])

or later

df.columns = ['Date','Daily Return']

And later I can use cut with bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val]

min_val = df['Daily Return'].min() - 1
max_val = df['Daily Return'].max()   1

regions = pd.cut(df['Daily Return'],
                 bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val],
                 labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
                 )

And calculate number of events

count = regions.value_counts(sort=False)
print(count)
ret < -1(%)            0
-1 < ret < -0.5(%)     0
-0.5 < ret < 0(%)     30
0 < ret < 0.5(%)      29
0.5 < ret < 1(%)       0
ret > 1(%)             0

And use it to calculate percentage

size = len(regions)
percentage = (count/size) * 100
print(percentage)
ret < -1(%)            0.000000
-1 < ret < -0.5(%)     0.000000
-0.5 < ret < 0(%)     50.847458
0 < ret < 0.5(%)      49.152542
0.5 < ret < 1(%)       0.000000
ret > 1(%)             0.000000

Now it would need only to put all in DataFrame to format table

results = pd.DataFrame({'# of event': count, '% of event': percentage})

print(results.T.to_string())
            ret < -1(%)  -1 < ret < -0.5(%)  -0.5 < ret < 0(%)  0 < ret < 0.5(%)  0.5 < ret < 1(%)  ret > 1(%)
# of event          0.0                 0.0          30.000000         29.000000               0.0         0.0
% of event          0.0                 0.0          50.847458         49.152542               0.0         0.0

Full working code with example data readed with io.StringIO

but you should use own method to get data

text = '''2020-01-02    0.022034
2020-01-03   -0.002666
2020-01-06    0.009716
2020-01-07    0.009838
2020-01-08   -0.011690
2020-01-09    0.025103
2020-01-10    0.009325
2020-01-13    0.028888
2020-01-14   -0.009183
2020-01-15    0.012292
2020-01-16   -0.005593
2020-01-17    0.020492
2020-01-20   -0.003878
2020-01-21   -0.032687
2020-01-22    0.034887
2020-01-23   -0.033485
2020-01-24    0.001934
2020-01-29   -0.026629
2020-01-30   -0.039513
2020-01-31   -0.001845
2020-02-03    0.021784
2020-02-04    0.033137
2020-02-05    0.000586
2020-02-06    0.016146
2020-02-07    0.000082
2020-02-10   -0.016997
2020-02-11    0.010172
2020-02-12    0.016836
2020-02-13    0.013530
2022-01-31    0.031707
2022-02-04    0.028683
2022-02-07   -0.015853
2022-02-08   -0.024170
2022-02-09    0.045076
2022-02-10    0.013623
2022-02-11   -0.012259
2022-02-14   -0.023093
2022-02-15   -0.008984
2022-02-16    0.023177
2022-02-17    0.003182
2022-02-18   -0.054995
2022-02-21   -0.033302
2022-02-22   -0.028148
2022-02-23    0.012332
2022-02-24   -0.048095
2022-02-25   -0.004944
2022-02-28   -0.002682
2022-03-01    0.006940
2022-03-02    0.002542
2022-03-03   -0.006318
2022-03-04   -0.048641
2022-03-07   -0.050231
2022-03-08   -0.015469
2022-03-09    0.011477
2022-03-10   -0.002236
2022-03-11   -0.038740
2022-03-14   -0.115421
2022-03-15   -0.089573
2022-03-16    0.243084
'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep='\s ', names=['date', 'value'])
df.columns = ['Date','Daily Return']
#print(df)

min_val = df['Daily Return'].min() - 1
max_val = df['Daily Return'].max()   1

regions = pd.cut(df['Daily Return'],
                 bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val],
                 labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
                 )

count = regions.value_counts(sort=False)
print(count)

size = len(regions)
percentage = (count/size) * 100
print(percentage)

results = pd.DataFrame({'# of event': count, '% of event': percentage})

print(results.T.to_string())

EDIT:

As @tdy suggests in comment you can also use -np.inf, np.inf instead of min_val, max_val

import numpy as np

regions = pd.cut(df['Daily Return'],
                 bins=[-np.inf, -1, -0.5, 0, 0.5, 1, np.inf],
                 labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
                 )
  • Related