Home > other >  cumsum based on taking first value from another column the creating a new calculation
cumsum based on taking first value from another column the creating a new calculation

Time:06-26

I was hoping to get some help with a calculation I'm struggling a bit with. I'm working with some data (copied below) and I need create a calculation that takes the first value > 0 from another column and computes a new series based on that value, and then aggregates the numbers giving a cumulative sum. My raw data looks like this:

d = {'Final Account': ['A', 'A', 'A' ,'A' , 'A', 'A', 'A','A' ,'A' ,'A', 'A', 'A', 'A'], 
     'Date': ['Jun-21','Jul-21','Aug-21','Sep-21','Oct-21','Nov-21','Dec-21','Jan-22','Feb-22','Mar-22','Apr-22','May-22','Jun-22'],
     'Units':[0, 0, 0, 0, 10, 0, 20, 0, 0, 7, 12, 35, 0]}
df = pd.DataFrame(data=d)

Account Date    Units
A       Jun-21  0
A       Jul-21  0
A       Aug-21  0
A       Sep-21  0
A       Oct-21  10
A       Nov-21  0
A       Dec-21  20
A       Jan-22  0
A       Feb-22  0
A       Mar-22  7
A       Apr-22  12
A       May-22  35
A       Jun-22  0

To the table I do an initial conversion for my data which is:

df['Conv'] = df['Units'].apply(x/5)

This adds a new column to my table like this:

   Account  Date    Units   Conv
   A        Jun-21  0       0
   A        Jul-21  0       0
   A        Aug-21  0       0
   A        Sep-21  0       0
   A        Oct-21  10      2
   A        Nov-21  0       0 
   A        Dec-21  20      4 
   A        Jan-22  0       0
   A        Feb-22  0       0
   A        Mar-22  7       1
   A        Apr-22  12      2
   A        May-22  35      7
   A        Jun-22  0       0

The steps after this I begin to run into issues. I need to calculate new field which takes the first value of the conv field > 0, at the same index position and begins a new calculation based on the previous rows cumsum and then adds it back into the cumsum following the calculation. Outside of python this is done by creating two columns. One to calculate new units by:

(Units - (previous row cumsum of existing units * 2))/5

Then existing units which is just the cumsum of the values that have been figured out to be new units. The desired output should look something like this:

   Account  Date    Units   Conv  New Units  Existing Units (cumsum of new units)
   A        Jun-21  0       0     0          0
   A        Jul-21  0       0     0          0
   A        Aug-21  0       0     0          0
   A        Sep-21  0       0     0          0 
   A        Oct-21  10      2     2          2 
   A        Nov-21  0       0     0          2
   A        Dec-21  20      4     3          5
   A        Jan-22  0       0     0          5
   A        Feb-22  0       0     0          5
   A        Mar-22  7       1     0          5
   A        Apr-22  12      2     0          5
   A        May-22  35      7     5          10
   A        Jun-22  0       0     0          10

The main issue I'm struggling with is grabbing the first value >0 from the "Conv" column and being able to create a new cumsum based on that initial value that can be applied to the "New Units" calculation. Any guidance is much appreciated, and despite reading a lot around I've hit a bit of a brick wall! If you need me to explain better please do ask! :)

Much appreciated in advance!

CodePudding user response:

Try using a for loop that performs the sample calculation you provided

# initialize new rows to zero
df['new u'] = 0
df['ext u'] = 0

# set first row cumsum
df['ext u'][0] = df['units'][0]//5

# loop through the data frame to perform the calculations
for i in range(1, len(df)):
    # calculate new units
    df['new u'][i] = (df['units'][i]-2*df['ext u'][i-1])//5
    # calculate existing units
    df['ext u'][i] = df['ext u'][i-1]   df['new u']

I'm not certain that those are the exact expressions you are looking for, but hopefully this gets you on your way to a solution. Worth noting that this does not take care of the whole "first value > 0" thing because (feel free to correct me but) it seems like before that you will just be adding up zeros, which won't affect anything. Hope this helps!

CodePudding user response:

I'm not sure that I completely understand what you are trying to achieve. Nevertheless, here's an attempt to reproduce your expected results. For your example frame this

groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
df['New Units'] = 0
last = 0
for _, group in df['Units'].groupby(groups):
    i, unit = group.index[-1], group.iloc[-1]
    if unit != 0:
        new_unit = (unit - last * 2) // 5
        last = df.loc[i, 'New Units'] = new_unit

does result in

   Final Account    Date  Units  New Units
0              A  Jun-21      0          0
1              A  Jul-21      0          0
2              A  Aug-21      0          0
3              A  Sep-21      0          0
4              A  Oct-21     10          2
5              A  Nov-21      0          0
6              A  Dec-21     20          3
7              A  Jan-22      0          0
8              A  Feb-22      0          0
9              A  Mar-22      7          0
10             A  Apr-22     12          0
11             A  May-22     35          5
12             A  Jun-22      0          0

The first step identifies the blocks in column Units whose last item is relevant for building the new units: Successive zeros, followed by non-zeros, until the first zero. This

groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()

results in

0     1
1     1
2     1
3     1
4     1
5     2
6     2
7     3
8     3
9     3
10    3
11    3
12    4

Then group column Units along these blocks, grab the last item if each block if it is non-zero (zero can only happen in the last block), build the new unit (according to the given formula) and store it in the new column New Units.

(If you actually need the column Existing Units then just use .cumsum() on the column New Units.)

  • Related