I have a dataset where in I have to identify if the Sale value of a Toy is greater than the average in the column and count in how many different sale areas, value is greater than the average.
For Ex: find the average of column "Sale B"
- 2.5, check in how many rows the value is greater than 2.5 and then perform the same exercise for "SaleA"
and "SaleC"
and then add all of it
input_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5]})
New column "Count_Sale_Average"
is created, For ex: for toy "A" only at one position the sale was greater than average.
output_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5],
'Count_Sale_Average':[1,2,1,1]})
My code is working and giving the desired output. Any suggestions on other ways of doing it, may be more efficient and in less number of lines.
list_var = ['SaleA','SaleB','SaleC']
df = input_data[list_var]
for i in range(0,len(list_var)):
var = list_var[i]
mean_var = df[var].mean()
df[var] = df[var].apply(lambda x: 1 if x > mean_var else 0)
df['Count_Sale_Average'] = df[list_var].sum(axis=1)
output_data = pd.concat([input_data, df[['Count_Sale_Average']]], axis=1)
output_data
CodePudding user response:
You could filter
, find mean
and sum
on axis:
filtered = input_data.filter(like='Sale')
input_data['Count_Sale_Average'] = filtered.gt(filtered.mean()).sum(axis=1)
Output:
Toy Color SaleA SaleB SaleC Count_Sale_Average
0 A Red 1 1 5 1
1 B Green 2 3 2 2
2 C Orange 0 4 3 1
3 D Blue 1 2 5 1