I have a large dataset of 481K rows x 184 columns. Below is an example table. I want to be able to have a sales total for each column based on a set of criteria.
Name (Criteria = All excl. Jim) = $275 sales
Position (Criteria = All exlc. Associates) = $250 sales
City (Criteria = All excl. Detroit) = $75 sales
Ideally the current 184 columns would return as a row with an associated sales value in the second column. I'm really just trying to figure out where to start with this one. Thank you for any helping or ideas you may have.
Name Position City Sales
0 Jon Owner Detroit 200.0
1 Joe Manager Philadelphia 50.0
2 Jim Associate Orlando 25.0
3 Jerry Associate Detroit 25.0
CodePudding user response:
Example:
df[df['Name']!='Jon']['Sales'].sum()
Generalized:
df[df[col_name]!=exclude_variable]['Sales'].sum()
CodePudding user response:
Try this:
df = pd.DataFrame({'Name': ['Jon', 'Joe', 'Jim', 'Jerry'],
'Position':['Owner', 'Manager', 'Associate', 'Associate'],
'City':['Detroit', 'Philadelphia', 'Orlando', 'Detroit'],
'Sales':[200.0, 50.0, 25.0, 25.0]})
def grab_total_sales(df, base_column='Name', exclude_list=['Jim']):
if base_column not in df.columns:
print(f'unknown column: {base_column} not in {list(df_columns)}')
return None
else:
df_ = df[~df[base_column].isin(exclude_list)]
return df_.Sales.sum()
return None
# print(grab_total_sales(df, base_column='Name', exclude_list=['Jim']))
# 275.0
# print(grab_total_sales(df, base_column='Position', exclude_list=['Associate']))
# 250.0
# print(grab_total_sales(df, base_column='City', exclude_list=['Detroit']))
# 75.0