I have a dataframe like as shown below
ID_1,value_1,value_2,ID_2
1,21,0,A1
1,31,5,A2
1,0,0,A3
1,21,0,A4
1,21,21,A5
2,202,0,B1
2,310,159,B2
2,0,0,B3
2,201,310,B4
2,210,214,B5
2,178,190,B6
tdf = pd.read_clipboard(sep=',')
I would like to compute the mean, median and mode for each of the ID_1
However, I would like to exclude rows which has zeroes for both value_1
AND value_2
For ex: We will ignore/filter out row index 2 and 7.
Once, we ignore rows with zero values, we compute mean
, median
and mode
based on values from value_1
and value_2
for each ID_1
. Meaning, we concatenate the values in both columns value_1
and value_2
and then compute the statistics
I was trying something like belowm but it is incorrect
tdf = tdf[~(tdf['value_1']==0) & (tdf['value_2']==0)] #this is incorrect logic
t1_count = tdf.groupby('ID')['value_1'].size().reset_index()
t1_values = tdf.groupby('ID')['value_1'].agg(list)
I expect my output to be like as shown below
ID_1 mean mode median
0 1 18.8 21 21.0
1 2 183.5 310 201.5
CodePudding user response:
If need remove rows with both 0
use:
tdf = tdf[~((tdf['value_1']==0) & (tdf['value_2']==0))]
Or:
tdf = tdf[(tdf['value_1']!=0) | (tdf['value_2']!=0)]
Or:
tdf = tdf[tdf[['value_1','value_1']].ne(0).all(axis=1)]
And then:
df = (tdf.melt(id_vars='ID_1',
value_vars=['value_1','value_2'])
.groupby('ID_1')['value'].agg(['mean','median',lambda x: x.mode().iat[0]])
.rename(columns={'<lambda_0>':'mode'})
.reset_index())
print (df)
ID_1 mean median mode
0 1 15.0 21.0 21
1 2 197.4 201.5 310