Home > database >  Compare data series columns based on a column that contains duplicates
Compare data series columns based on a column that contains duplicates

Time:12-08

I have a dataset that I've created from merging 2 df's together on the "NAME" column and now I have a larger dataset. To finish the DF, I want to perform some logic to it to clean it up.

Requirements: I want to select the unique 'NAME' but I want to match the name with the highest Sales row, and if after going though the Sales column, all rows are less than 10, then move to the Calls column and select highest the row with the highest Call, and if all calls in the 'CALLS' are less than 10 then move to the Target Column select the highest Target. No rows are summed.

Here's my DF:

             NAME       CUSTOMER_SUPPLIER_NUMBER           Sales  Calls  Target
0        OFFICE 1        2222277                             84    170     265
1        OFFICE 1        2222278                             26    103     287
2        OFFICE 1        2222278                             97    167     288
3        OFFICE 2        2222289                              7    167     288
4        OFFICE 2        2222289                              3    130     295
5        OFFICE 2        2222289                              9    195     257
6        OFFICE 3        1111111                              1      2     286
7        OFFICE 3        1111111                              5      2     287
8        OFFICE 3        1111112                              9      7     230
9        OFFICE 4        1111171                             95     193    299
10       OFFICE 5        1111191                              9     193    298

Here's what I want to show in the final DF:

             NAME       CUSTOMER_SUPPLIER_NUMBER           Sales  Calls  Target
0        OFFICE 1        2222277                             97    167     288
5        OFFICE 2        2222289                              9    195     257
7        OFFICE 3        1111111                              5      2     287
9        OFFICE 4        1111171                             95    193     299
10       OFFICE 5        1111191                              9    193     298

I was thinking of solving this by using df.itterows()

Here's what I've tried:

for n, v in df.iterrows(): 
    if int(v['Sales']) > 10:
        calls = df.loc[(v['NAME'] == v) & (int(v['Calls'].max()))]
        if int(calls['Calls']) > 10:
            target =  df.loc[(v['NAME'] == v) & (int(v['Target'].max()))]
        else:
            print("No match found")
    else:
       sales = df.loc[(v['NAME'] == v) & (int(v['Sales'].max())]

However, I keep getting KeyError: False error messages. Any thoughts on what I'm doing wrong?

CodePudding user response:

This is not optimized, but it should meet your needs. The code snippet sends each NAME group to eval_group() where it checks the highest index for each column until the Sales, Calls, Target criteria is met.

If you were to optimize, then you could apply vectorization or parallelism principles to the eval_group so it is called against all groups at once, instead of sequentially.

A couple of notes, this will return the first row if a race condition is found (i.e. multiple records have the same maximum during idxmax() call). Also, I believe in your question, the first row in the desired answer should have OFFICE 1 being row 2, not 0.

df = pd.read_csv('./data.txt')

def eval_group(df, keys) :
    for key in keys : 
        row_id = df[key].idxmax()
        if df.loc[row_id][key] >= 10 or key == keys[-1] :
            return row_id

row_ids = []
keys = ['Sales','Calls','Target']
for name in df['NAME'].unique().tolist() :
    condition = df['NAME'] == name 
    row_ids.append( eval_group( df[condition], keys) )

df = df[ df.index.isin(row_ids) ]

df
        NAME  CUSTOMER_SUPPLIER_NUMBER  Sales  Calls  Target
2   OFFICE 1                   2222278     97    167     288
5   OFFICE 2                   2222289      9    195     257
7   OFFICE 3                   1111111      5      2     287
9   OFFICE 4                   1111171     95    193     299
10  OFFICE 5                   1111191      9    193     298

CodePudding user response:

This takes a couple of steps, where you have to build intermediate dataframes, do a conditional, and filter based on the result of the conditions:

temp = (df
        .drop(columns = 'CUSTOMER_SUPPLIER_NUMBER')
        .groupby('NAME', sort = False)
        .idxmax()
         )

# get the booleans for rows less than 10
bools = df.loc(axis=1)['Sales':'Target'].lt(10)

# groupby for each NAME
bools =  bools.groupby(df.NAME, sort = False).all()

# conditions buildup
condlist = [~bool_check.Sales, ~bool_check.Calls, ~bool_check.Target]
choicelist = [temp.Sales, temp.Calls, temp.Target]
# you might have to figure out what to use for default
indices = np.select(condlist, choicelist, default = temp.Sales)

# get matching rows
df.loc[indices]
        NAME  CUSTOMER_SUPPLIER_NUMBER  Sales  Calls  Target
2   OFFICE 1                   2222278     97    167     288
5   OFFICE 2                   2222289      9    195     257
7   OFFICE 3                   1111111      5      2     287
9   OFFICE 4                   1111171     95    193     299
10  OFFICE 5                   1111191      9    193     298
  • Related