I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep | open_date(MM/DD/YYYY) | close_date | amount |
---|---|---|---|
Jim | 1/01/2021 | 2/05/2021 | 3 |
Jim | 1/15/2021 | 4/06/2021 | 26 |
Jim | 2/01/2021 | 2/06/2021 | 7 |
Jim | 2/15/2021 | 3/14/2021 | 12 |
Jim | 3/01/2021 | 4/22/2021 | 13 |
Jim | 3/15/2021 | 3/29/2021 | 5 |
Jim | 4/01/2021 | 4/20/2021 | 17 |
Bob | 1/01/2021 | 1/12/2021 | 23 |
Bob | 1/15/2021 | 2/16/2021 | 12 |
Bob | 2/01/2021 | 3/04/2021 | 4 |
Bob | 2/15/2021 | 4/05/2021 | 23 |
Bob | 3/01/2021 | 3/24/2021 | 12 |
Bob | 3/15/2021 | 4/15/2021 | 7 |
Bob | 4/01/2021 | 5/01/2021 | 20 |
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep | open_date(MM/DD/YYYY) | close_date | amount | open_amount_sum |
---|---|---|---|---|
Jim | 1/01/2021 | 2/05/2021 | 3 | 36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.) |
Jim | 1/15/2021 | 4/06/2021 | 26 | 56 |
Jim | 2/01/2021 | 2/06/2021 | 7 | 33 |
Jim | 2/15/2021 | 3/14/2021 | 12 | 51 |
Jim | 3/01/2021 | 4/22/2021 | 13 | 13 |
Jim | 3/15/2021 | 3/29/2021 | 5 | 44 |
Jim | 4/01/2021 | 4/20/2021 | 17 | 30 |
Bob | 1/01/2021 | 1/12/2021 | 23 | 23 |
Bob | 1/15/2021 | 2/16/2021 | 12 | 39 |
Bob | 2/01/2021 | 3/04/2021 | 4 | 39 |
Bob | 2/15/2021 | 4/05/2021 | 23 | 50 |
Bob | 3/01/2021 | 3/24/2021 | 12 | 42 |
Bob | 3/15/2021 | 4/15/2021 | 7 | 27 |
Bob | 4/01/2021 | 5/01/2021 | 20 | 20 |
CodePudding user response:
Edit @RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df
CodePudding user response:
Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20