I have a tricky data cleaning requirement using pandas. I have a following dataframe with just binary value. Each row refers a customer's response to survey questions. Below df has two questions (q1 & q2) and their response is denoted by '_' sign. For example, customer, aasww, provided 'a' as response to q1 and 'b' as response to q2. There could be cases where a customer could skip a question which is denoted by 0 (for example, customer asdsd skipped q1)
Customer_id | q1_a | q1_b | q1_0 | q2_a | q2_b | q2_c | q2_0 |
---|---|---|---|---|---|---|---|
asdsd | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
aasww | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
aaswe | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
aaswt | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
Now, I want to calculate average of all customers for each column by excluding columns where it has '_0'. But the tricky part here is, I want to calculate average for only among customers who answered that SPECIFIC question. For example, in the above dataframe, customer asdsd skipped q1 so while calculating average that customer should be ignored for Q1. Here is the expected output,
Customer_id | q1_a | q1_b | q2_a | q2_b | q2_c |
---|---|---|---|---|---|
asdsd | 0 | 0 | 0 | 0 | 1 |
aasww | 1 | 0 | 0 | 1 | 0 |
aaswe | 0 | 1 | 0 | 0 | 0 |
aaswt | 0 | 1 | 1 | 0 | 0 |
AVERAGE | 33.3% | 66.7% | 33.3% | 33.3% | 33.3% |
Also, I have 100s of columns like this (too many questions in the survey) so it would be great if we can have a solution that kind of loops through it
CodePudding user response:
Here's one way using melt
. The idea is, we filter the relevant part of the DataFrame and melt
it; then filter out the customers with all 0 answers to a question (build a filter transforming all
by groupby.transform
); then for the remaining customers, find the mean
response using groupby.mean
. Finally, assign these means back to an out
DataFrame:
out = df.loc[:, ~df.columns.str.endswith('_0')].copy()
df1 = out.melt('Customer_id')
df1 = df1.join(df1.pop('variable').str.split('_', expand=True))
percentages = (df1.loc[~df1['value'].eq(0).groupby([df1['Customer_id'], df1[0]]).transform('all')]
.groupby([ 0, 1])['value'].mean() * 100)
percentages.index = percentages.index.map('_'.join)
out.loc[len(out)] = percentages
out.loc[len(out)-1, 'Customer_id'] = 'Average'
Output:
Customer_id q1_a q1_b q2_a q2_b q2_c
0 asdsd 0.000000 0.000000 0.000000 0.000000 1.000000
1 aasww 1.000000 0.000000 0.000000 1.000000 0.000000
2 aaswe 0.000000 1.000000 0.000000 0.000000 0.000000
3 aaswt 0.000000 1.000000 1.000000 0.000000 0.000000
4 Average 33.333333 66.666667 33.333333 33.333333 33.333333
CodePudding user response:
- Get the number of customers that answered each question
- Divide the sum of each column by the number that answered
#set customer_id as the index
df = df.set_index("Customer_id")
#filter columns that do not end with _0
answered = df.filter(regex="^(?!.*_0$)")
#map the column names to the number of customers that answered
counter = df.columns.str.split("_").str[0].map(answered.groupby(answered.columns.str.split("_").str[0],axis=1).sum().sum())
#compute averages
df.loc["Average"] = df.sum().div(counter)
#keep only answered columns in the output
output = df[answered.columns]
>>> df
q1_a q1_b q2_a q2_b q2_c
Customer_id
asdsd 0.000000 0.000000 0.000000 0.000000 1.000000
aasww 1.000000 0.000000 0.000000 1.000000 0.000000
aaswe 0.000000 1.000000 0.000000 0.000000 0.000000
aaswt 0.000000 1.000000 1.000000 0.000000 0.000000
Average 0.333333 0.666667 0.333333 0.333333 0.333333