I'm trying to group aggregate values of feedback based on their characteristics. For example, the below code creates a data frame which collects information from individuals and their feedback scores for questions referring to particular yearly touch-points.
import pandas as pd
import numpy as np
dummydf = pd.DataFrame({'ID': [2,15,32,4,9,12,16,10,3,7],
'1-year feedback qs A': [3,2,3,4,3,2,1,3,4,5],
'1-year feedback qs B': [1,1,2,4,np.NaN,3,3,3,2,5],
'2-year feedback qs A': [2,2,3,4,3,5,3,2,2,4],
'2-year feedback qs B': [2,3,3,3,4,5,3,np.NaN,5,5],
'Gender': [0,0,0,1,0,1,1,0,0,1],
'Location': ['py','py','py','va','jk','ce','ce','va','jk','jk']})
print(dummydf)
For each ID I need to group the values of the 1-year questions together as a mean aggregated score, 2-year question touch-points together and so on, at the same time keeping the rest of the variables intact. What is the best way of achieving the result?
What I tried is -
groups = dummydf.groupby(by=['ID'])
groups.apply(lambda g: g[g.filter(regex='1-') == g.filter(regex='1-').mean()])
which is not giving me the desired result
CodePudding user response:
Since ID is unique, you don't need to group them. You can just use:
for i in range(1,2):
dummydf['mean_year_' str(i)] = dummydf[[x for x in dummydf.columns if str(i) in x]].mean(axis=1)