In Python I need to do an iterative groupby that access the previous "grouped value" to es-CodePudding

I have the following dataset that you can replicate with this code:

number_order = [2,2,3,3,5,5,5,6]
number_fakecouriers = [1,2,1,2,1,2,3,3]
dictio = {"number_order":number_order, "number_fakecouriers":number_fakecouriers}
actual_table = pd.DataFrame(dictio)

What I need is to write a code that through a for loop or a groupby generates the following result:

The code should perform a groupby on the column "number_orders" and then take the minimum of the column "number_fakeorders", but each time it should iteratively exclude the minimum values of the column "number_fakeorders" that have been already selected. Then in case there are no more values available it should input a "None".

This is the explanation row by row:

1) "number_orders" = 2 : here the value of "number_fakeorders" is "1", and it is simply the minimum value of "number_fakeorders", where ["number_orders" = 2], because it is the first value that appears

2) "number_orders" = 3 : here the value of "number_fakeorders" is "2" because "1" has been already selected for ["number_orders" = 2], so excluding "1", where ["number_orders" = 3] the minimum value is "2"

3) "number_orders" = 5 : here the value of "number_fakeorders" is "3" because "1" and "2" have been already selected

4) "number_orders" = 6 : here the value of "number_fakeorders" is "None" because the only value of "number_fakeorders" where ["number_orders" = 6] is "3", and "3" has already been selected

CodePudding user response：

Try:

def fn(x, seen):
    for v in x:
        if v in seen:
            continue
        seen.add(v)
        return v


out = (
    actual_table.groupby("number_order")["number_fakecouriers"]
    .apply(fn, seen=set())
    .reset_index()
)
print(out)

Prints:

   number_order  number_fakecouriers
0             2                  1.0
1             3                  2.0
2             5                  3.0
3             6                  NaN

Note: You can sort dataframe before processing (if not sorted already):

actual_table = actual_table.sort_values(
    by=["number_order", "number_fakecouriers"]
)

...

CodePudding user response：

Loop the groupby object and record previous min value in each group

res, prev_min = [], set()

for name, group in actual_table.groupby('number_order'):
    diff = set(group['number_fakecouriers']).difference(prev_min)
    if len(diff):
        m = min(diff)
        prev_min.add(m)
    else:
        m = np.nan

    res.append([name, m])

out = pd.DataFrame(res, columns=actual_table.columns)

print(out)

   number_order  number_fakecouriers
0             2                  1.0
1             3                  2.0
2             5                  3.0
3             6                  NaN