How to calculate average of all rows by excluding a few columns based on a condition-CodePudding

I have a tricky data cleaning requirement using pandas. I have a following dataframe with just binary value. Each row refers a customer's response to survey questions. Below df has two questions (q1 & q2) and their response is denoted by '_' sign. For example, customer, aasww, provided 'a' as response to q1 and 'b' as response to q2. There could be cases where a customer could skip a question which is denoted by 0 (for example, customer asdsd skipped q1)

Customer_id	q1_a	q1_b	q1_0	q2_a	q2_b	q2_c	q2_0
asdsd	0	0	1	0	0	1	0
aasww	1	0	0	0	1	0	0
aaswe	0	1	0	0	0	0	1
aaswt	0	1	0	1	0	0	0

Now, I want to calculate average of all customers for each column by excluding columns where it has '_0'. But the tricky part here is, I want to calculate average for only among customers who answered that SPECIFIC question. For example, in the above dataframe, customer asdsd skipped q1 so while calculating average that customer should be ignored for Q1. Here is the expected output,

Customer_id	q1_a	q1_b	q2_a	q2_b	q2_c
asdsd	0	0	0	0	1
aasww	1	0	0	1	0
aaswe	0	1	0	0	0
aaswt	0	1	1	0	0
AVERAGE	33.3%	66.7%	33.3%	33.3%	33.3%

Also, I have 100s of columns like this (too many questions in the survey) so it would be great if we can have a solution that kind of loops through it

CodePudding user response：

Here's one way using melt. The idea is, we filter the relevant part of the DataFrame and melt it; then filter out the customers with all 0 answers to a question (build a filter transforming all by groupby.transform); then for the remaining customers, find the mean response using groupby.mean. Finally, assign these means back to an out DataFrame:

out = df.loc[:, ~df.columns.str.endswith('_0')].copy()
df1 = out.melt('Customer_id')
df1 = df1.join(df1.pop('variable').str.split('_', expand=True))
percentages = (df1.loc[~df1['value'].eq(0).groupby([df1['Customer_id'], df1[0]]).transform('all')]
               .groupby([ 0, 1])['value'].mean() * 100)
percentages.index = percentages.index.map('_'.join)
out.loc[len(out)] = percentages
out.loc[len(out)-1, 'Customer_id'] = 'Average'

Output:

  Customer_id       q1_a       q1_b       q2_a       q2_b       q2_c
0       asdsd   0.000000   0.000000   0.000000   0.000000   1.000000
1       aasww   1.000000   0.000000   0.000000   1.000000   0.000000
2       aaswe   0.000000   1.000000   0.000000   0.000000   0.000000
3       aaswt   0.000000   1.000000   1.000000   0.000000   0.000000
4     Average  33.333333  66.666667  33.333333  33.333333  33.333333

CodePudding user response：

Get the number of customers that answered each question
Divide the sum of each column by the number that answered

#set customer_id as the index
df = df.set_index("Customer_id")

#filter columns that do not end with _0
answered = df.filter(regex="^(?!.*_0$)")

#map the column names to the number of customers that answered
counter = df.columns.str.split("_").str[0].map(answered.groupby(answered.columns.str.split("_").str[0],axis=1).sum().sum())

#compute averages
df.loc["Average"] = df.sum().div(counter)

#keep only answered columns in the output
output = df[answered.columns]

>>> df
                 q1_a      q1_b      q2_a      q2_b      q2_c
Customer_id                                                  
asdsd        0.000000  0.000000  0.000000  0.000000  1.000000
aasww        1.000000  0.000000  0.000000  1.000000  0.000000
aaswe        0.000000  1.000000  0.000000  0.000000  0.000000
aaswt        0.000000  1.000000  1.000000  0.000000  0.000000
Average      0.333333  0.666667  0.333333  0.333333  0.333333