Home > Software engineering >  Pass function into pandas groupby
Pass function into pandas groupby

Time:06-29

I have a df that looks like this:

d = {'type': ['A', 'A', 'A' ,'A' , 'A', 'A', 'A','A' ,'A' ,'A', 'A', 'A', 'A','B', 'B', 'B' ,'B' , 'B', 'B', 'B','B' ,'B' ,'B', 'B', 'B', 'B'], 
 'Date': ['Jun-21','Jul-21','Aug-21','Sep-21','Oct-21','Nov-21','Dec-21','Jan-22','Feb-22','Mar-22','Apr-22','May-22','Jun-22', 'Jun-21','Jul-21','Aug-21','Sep-21','Oct-21','Nov-21','Dec-21','Jan-22','Feb-22','Mar-22','Apr-22','May-22','Jun-22'],
 'Units':[0, 0, 0, 0, 10, 0, 20, 0, 0, 7, 12, 35, 0, 0,0,0,0,7,4,0,4,9,5,8,3,11]}
df = pd.DataFrame(data=d)

Type    Date    Value
A       Jun-21  0
A       Jul-21  0
A       Aug-21  0
A       Sep-21  0
A       Oct-21  10
A       Nov-21  0
A       Dec-21  20
A       Jan-22  0
A       Feb-22  0
A       Mar-22  7
A       Apr-22  12
A       May-22  35
A       Jun-22  0
B       Jun-21  0
B       Jul-21  0
B       Aug-21  0
B       Sep-21  0
B       Oct-21  7
B       Nov-21  4
B       Dec-21  0
B       Jan-22  4
B       Feb-22  9
B       Mar-22  5
B       Apr-22  8
B       May-22  3
B       Jun-22  11

I've got a function by googling and asking another question in stack overflow that calculates a certain value that I need:

def my_function(df):
    df['Expected'] = 0
    for i in range(1, len(df)):
        if df['Units'][i] == 0:
            df['Expected'][i] = df['Expected'][i-1]
        if df['Units'][i] > 0:
            df['Expected'][i] = ((df['Units'][i]-2*df['Expected'][i-1])//5).clip(0)   df['Expected'][i-1].cumsum()

However, how can I adapt this function so that it can work over different groups for my data. Currently my function is only accurate when I pass it one group at a time. I've tried a few different methods including "for _, group_key in groups:" but I can't seem to get the function to work for each group that my data has.

My expected output should be this:

Type    Date    Value  Expected
A       Jun-21  0      0
A       Jul-21  0      0
A       Aug-21  0      0
A       Sep-21  0      0
A       Oct-21  10     2
A       Nov-21  0      2
A       Dec-21  20     5
A       Jan-22  0      5
A       Feb-22  0      5
A       Mar-22  7      5
A       Apr-22  12     5
A       May-22  35     10
A       Jun-22  0      10
B       Jun-21  0      0
B       Jul-21  0      0
B       Aug-21  0      0
B       Sep-21  0      0
B       Oct-21  7      1
B       Nov-21  4      1
B       Dec-21  0      1
B       Jan-22  4      1
B       Feb-22  9      2
B       Mar-22  5      2
B       Apr-22  8      2
B       May-22  3      2
B       Jun-22  11     3 

It works fine passing each group to the function separately, but was just wondering what would be the best way to get this to work for each group in one go. Any help is greatly appreciated!

CodePudding user response:

Try grouping by type and apply my_function(). One change to make it work is to iterate over the index instead of range(len(df)), because the program needs to run in different groups.

def my_function(d):
    # initialize with 0
    d['Expected'] = 0
    # iterate over the index
    for i in d.index[1:]:
        if d.loc[i, 'Units'] == 0:
            d.loc[i, 'Expected'] = d.loc[i-1, 'Expected']
        elif d.loc[i, 'Units'] > 0:
            d.loc[i, 'Expected'] = d.loc[i-1, 'Expected']   (d.loc[i, 'Units'] - 2 * d.loc[i-1, 'Expected']).clip(0)//5
    return d

# apply the function by types
df = df.groupby('type').apply(my_function)
df

enter image description here

I also cleaned up my_function() a little by using .loc instead of [][], removed the unnecessary cumsum() (df['Expected'][i-1].cumsum() is a single value so cumsum() is redundant) and made the floor-division (//) after addition (because there is clip(0), the function doesn't change) to get rid of one extra layer of brackets.

  • Related