Home > Enterprise >  Calculate statistics based on values from multiple columns
Calculate statistics based on values from multiple columns

Time:03-11

I have a dataframe like as shown below

ID_1,value_1,value_2,ID_2         
1,21,0,A1
1,31,5,A2
1,0,0,A3
1,21,0,A4
1,21,21,A5
2,202,0,B1
2,310,159,B2
2,0,0,B3
2,201,310,B4
2,210,214,B5
2,178,190,B6

tdf = pd.read_clipboard(sep=',')

I would like to compute the mean, median and mode for each of the ID_1

However, I would like to exclude rows which has zeroes for both value_1 AND value_2

For ex: We will ignore/filter out row index 2 and 7.

Once, we ignore rows with zero values, we compute mean, median and mode based on values from value_1 and value_2 for each ID_1. Meaning, we concatenate the values in both columns value_1 and value_2 and then compute the statistics

I was trying something like belowm but it is incorrect

tdf = tdf[~(tdf['value_1']==0) & (tdf['value_2']==0)]  #this is incorrect logic
t1_count = tdf.groupby('ID')['value_1'].size().reset_index()
t1_values = tdf.groupby('ID')['value_1'].agg(list)

I expect my output to be like as shown below

    ID_1    mean    mode    median
0   1   18.8    21  21.0
1   2   183.5   310 201.5

CodePudding user response:

If need remove rows with both 0 use:

tdf = tdf[~((tdf['value_1']==0) & (tdf['value_2']==0))]

Or:

tdf = tdf[(tdf['value_1']!=0) | (tdf['value_2']!=0)]

Or:

tdf = tdf[tdf[['value_1','value_1']].ne(0).all(axis=1)]

And then:

df = (tdf.melt(id_vars='ID_1', 
              value_vars=['value_1','value_2'])
          .groupby('ID_1')['value'].agg(['mean','median',lambda x: x.mode().iat[0]])
          .rename(columns={'<lambda_0>':'mode'})
          .reset_index())
print (df)
   ID_1   mean  median  mode
0     1   15.0    21.0    21
1     2  197.4   201.5   310
  • Related