I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.
Requirements: I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.
Here's my DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 84 170 265
1 OFFICE 1 2222278 26 103 287
2 OFFICE 1 2222278 97 167 288
3 OFFICE 2 2222289 7 167 288
4 OFFICE 2 2222289 3 130 295
5 OFFICE 2 2222289 9 195 257
6 OFFICE 3 1111111 1 2 286
7 OFFICE 3 1111111 5 2 287
8 OFFICE 3 1111112 9 7 230
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
Here's what I want to show in the final DF:
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
0 OFFICE 1 2222277 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
I was thinking of solving this by using df.itterows()
Here's what I've tried:
for n, v in df.iterrows():
if int(v['Sales']) > 10:
calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
if int(calls['Calls']) > 10:
target = df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
else:
print("No match found")
else:
sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]
However, I keep getting KeyError: False
error messages. Any thoughts on what I'm doing wrong?
CodePudding user response:
This is not optimized, but it should meet your needs. The code snippet sends each NAME
group to eval_group()
where it checks the highest index for each column until the Sales, Calls, Target criteria is met.
If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group
so it is called against all groups at once, instead of sequentially.
A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax()
call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1
being row 2, not 0.
df = pd.read_csv('./data.txt')
def eval_group(df, keys) :
for key in keys :
row_id = df[key].idxmax()
if df.loc[row_id][key] >= 10 or key == keys[-1] :
return row_id
row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
condition = df['NAME'] == name
row_ids.append( eval_group( df[condition], keys) )
df = df[ df.index.isin(row_ids) ]
df
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298
CodePudding user response:
This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:
temp = (df
.drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
.groupby('NAME', sort = False)
.idxmax()
)
# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)
# groupby for each NAME
bools = bools.groupby(df.NAME, sort = False).all()
# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)
# get matching rows
df.loc[indices]
NAME CUSTOMER_SUPPLIER_NUMBER Sales Calls Target
2 OFFICE 1 2222278 97 167 288
5 OFFICE 2 2222289 9 195 257
7 OFFICE 3 1111111 5 2 287
9 OFFICE 4 1111171 95 193 299
10 OFFICE 5 1111191 9 193 298