Home > other >  Create new dataframe which summarizes the number of occurences for different numeric ranges of some
Create new dataframe which summarizes the number of occurences for different numeric ranges of some

Time:05-19

I have dataframe df as shown in the first photo. Green rows are the ones which have column Condition1 as 'Yes'. Yellow rows are the ones which have BOTH columns Condition1 and Condition2 as 'Yes'.

Question: See second photo which shows desired output which is a new dataframe in which:

(1). Column 'Count_Condition1' contains the number of times green row occurs. The column 'Count_Condition_1&2' contains the number of times yellow rows occur.

(2). In addition to (1), I would like to split the column 'Vol' to different ranges (see second photo) and the counts be displayed accordingly.

For example, there are 3 instances when Condition1= 'Yes' and the data range for Vol is between 0.2 to 0.2999.

enter image description here

enter image description here

CodePudding user response:

Use cut for bining column Vol and for count match Yes values create helper columns, last aggregate boolean for count Trues:

bins=[-np.inf, 0.2, 0.3, 0.4, 0.5, np.inf]

labels = [ f'{a} to {round(b-0.0001, 4)}'.replace('-inf to ', '<=').replace(' to inf', '') 
          for a, b in zip(bins, bins[1:])]
labels[-1] = '>='   labels[-1]

s1 = df['Condition1'].eq('Yes')
s2 = df['Condition2'].eq('Yes')
g = pd.cut(df['Vol'], bins = bins, right = False, labels = labels)


df1 = (df.assign(Count_Condition1 = s1, Count_Condition_1_2 = s1 & s2)
         .groupby(g)[['Count_Condition1','Count_Condition_1_2']]
         .sum())
print (df1)
               Count_Condition1  Count_Condition_1_2
Vol                                                 
<=0.1999                      2                    1
0.2 to 0.2999                 3                    1
0.3 to 0.3999                 1                    1
0.4 to 0.4999                 1                    0
>=0.5                         2                    2
  • Related