Home > Software engineering >  With pandas cut function how to list the missing bin values
With pandas cut function how to list the missing bin values

Time:12-29

I have a requirement where I need to group report execution counts. The input data looks like below.

REPORT,TIME_SEC
Report1,1
Report1,5
Report3,4
Report2,158
Report2,20
Report3,131

I need to group the reports and show the count of execution for each of the following time ranges '0-10sec','10-30sec','30-60sec','1-2min','>2min'.

My code is like below.

import pandas as pd

infile = "/Users/user1/data1.csv"
time_column = "TIME_SEC"
range_column = "TIME_RANGE"
groupby_column = "REPORT"
df = pd.read_csv(infile)
bin1 = [0,10,30,60,120,7200]
label1 = ['0-10sec','10-30sec','30-60sec','1-2min','>2min']
# print(f"df=\n{df}")
df[range_column] = pd.cut(df[time_column], bins=bin1, labels=label1, include_lowest=True)
print(f"df=\n{df}")

df_final = pd.crosstab(df[groupby_column], df[range_column])
df_final.columns = df_final.columns.astype(str)
df_final.reset_index(inplace=True)
print(f"df_final=\n{df_final}")

The output is like below.

df=
    REPORT  TIME_SEC TIME_RANGE
0  Report1         1    0-10sec
1  Report1         5    0-10sec
2  Report3         4    0-10sec
3  Report2       158      >2min
4  Report2        20   10-30sec
5  Report3       131      >2min
df_final=
TIME_RANGE   REPORT  0-10sec  10-30sec  >2min
0           Report1        2         0      0
1           Report2        0         1      1
2           Report3        1         0      1
bins = [   0   10   30   60  120 7200]

Now while I do pd.cut it doesn't list the ranges for which there's no data. Hence for the above example it doesn't list the ranges '30-60sec','1-2min', as there's no value for these ranges.

My requirement is to print the ranges '30-60sec','1-2min' (for which no values are present) also in the final output. How can I get the '30-60sec','1-2min' as part of pd.cut OR any another way so that when I do pd.crosstab then value '0' is printed for the ranges '30-60sec','1-2min'.

I'm expecting the df_final to be like below.

TIME_RANGE   REPORT  0-10sec  10-30sec  30-60sec  1-2min  >2min
0           Report1        2         0         0       0      0
1           Report2        0         1         0       0      1
2           Report3        1         0         0       0      1

CodePudding user response:

You can reindex the dataframe as your last step (with your list label1):

df = df.reindex(label1, fill_value=0, axis=1)
print(df)

Prints:

TIME_RANGE  0-10sec  10-30sec  30-60sec  1-2min  >2min
0                 2         0         0       0      0
1                 0         1         0       0      1
2                 1         0         0       0      1
  • Related