I have a data frame like the one below.
Event_ID ticket Revenue Expences expect Signed_Date start_Date end_Date
G-00001 671 6720 793 50 June 2021 2021-06-13 2021-08-13
G-00002 6 56 18 100 May 2021 2021-06-13 2021-07-13
G-00003 5 78 38 100 May 2021 2021-06-14 2021-09-14
G-00004 23 34 23 NaN June 2021 2021-06-13 2021-09-13
G-00005 4 89 43 40 June 2021 2021-06-14 2021-09-14
G-00006 60 73 20 60 April 2021 2021-06-15 2021-09-15
G-00007 60 345 110 60 June 2021 2021-06-15 2021-09-15
G-00008 89 890 NaN NaN June 2021 2021-06-13 2021-09-13
G-00009 0 0 0 50 May 2021 2021-06-16 2021-09-16
G-00010 6 45 16 60 June 2021 2021-06-13 2021-09-13
G-00011 3 39 23 30 June 2021 2021-06-13 2021-09-13
G-00012 2 34 72 20 June 2021 2021-06-13 2021-09-13
G-00013 4 89 48 40 June 2021 2021-06-16 2021-09-16
G-00014 32 127 35 10 April 2021 2021-05-23 2021-08-23
G-00015 3 84 28 120 April 2021 2021-05-13 2021-08-13
G-00016 1 100 25 140 March 2021 2021-03-26 2021-08-26
G-00017 23 525 39 90 May 2021 2021-05-13 2021-10-13
I want to create a duplicate Event_ID based on the "Signed_Date" field. If the "Signed_Date" field is not in the same month and year as "start_Date" I want to duplicate Event_ID and remove Expenses from the original Event_ID field and put it into the new duplicate Event_ID line. Also, I want to create the Event_date field based on the same above concept it "Signed_Date" field is not in the same month and year in "start_Date", Event_date should be the first day of the "Signed_Date" month else "start_Date"
for example, if we think the first two indexes in the above DF
Event_ID ticket Revenue Expences expect Signed_Date start_Date end_Date
G-00001 671 6720 793 50 June 2021 2021-06-13 2021-08-13
G-00002 6 56 18 100 May 2021 2021-06-13 2021-07-13
The above should look like below
Event_ID ticket Revenue Expences expect Signed_Date start_Date end_Date Event_Date
G-00001 671 6720 793 50 June 2021 2021-06-13 2021-08-13 2021-06-13
G-00002 6 56 0 100 May 2021 2021-06-13 2021-07-13 2021-06-13
G-00002 0 0 18 0 May 2021 2021-06-13 2021-07-13 2021-05-01
I'm trying to do something like that
df['StartDateMonth'] = df['start_Date'].dt.month
df['SignedDateMonth'] = df['Signed_Date'].dt.month
#if those are not equal then I'm going to make a copy of that line
index_to_copy = 0
number_of_extra_copies = 1
pd.concat([df,
pd.DataFrame(np.repeat(df.iloc[[index_to_copy]].values,
number_of_extra_copies,
axis=0),
columns=df.columns)]).sort_values(by='index').drop(columns='index').reset_index(drop=True)
But doesn't work
prfered way
Event_ID ticket Revenue Expences expect Signed_Date start_Date end_Date Event_Date
G-00001 223.6 2240 264.33 16.66 June 2021 2021-06-13 2021-08-13 2021-06-13
G-00001 223.6 2240 264.33 16.66 June 2021 2021-06-13 2021-08-13 2021-07-01
G-00001 223.6 2240 264.33 16.66 June 2021 2021-06-13 2021-08-13 2021-08-01
G-00002 3 28 0 50 May 2021 2021-06-13 2021-07-13 2021-06-13
G-00002 3 28 0 50 May 2021 2021-06-13 2021-07-13 2021-07-01
G-00002 0 0 18 0 May 2021 2021-06-13 2021-07-13 2021-05-01
Thanks in advance
CodePudding user response:
Create a boolean mask to identify rows based on a month period. After that you can split your 2 dataframes and update each one separately:
# Prepare some data
df['Event_Date'] = df['start_Date']
signed_date = pd.to_datetime(df['Signed_Date'])
# Boolean mask based on period
m = signed_date.dt.to_period('M') != df['start_Date'].dt.to_period('M')
# Create the duplicate dataframe
df1 = df[m].assign(ticket=0, Revenue=0, expect=0, Event_Date=signed_date)
# Reset Expences values
df.loc[m, 'Expences'] = 0
# Merge the 2 dataframes
out = pd.concat([df, df1]).sort_index(ignore_index=True)
Output:
>>> out
Event_ID ticket Revenue Expences expect Signed_Date start_Date end_Date Event_Date
0 G-00001 671 6720 793.0 50.0 June 2021 2021-06-13 2021-06-13 2021-06-13
1 G-00002 6 56 0.0 100.0 May 2021 2021-06-13 2021-06-13 2021-06-13
2 G-00002 0 0 18.0 0.0 May 2021 2021-06-13 2021-06-13 2021-05-01
3 G-00003 5 78 0.0 100.0 May 2021 2021-06-14 2021-06-14 2021-06-14
4 G-00003 0 0 38.0 0.0 May 2021 2021-06-14 2021-06-14 2021-05-01
5 G-00004 23 34 23.0 NaN June 2021 2021-06-13 2021-06-13 2021-06-13
6 G-00005 4 89 43.0 40.0 June 2021 2021-06-14 2021-06-14 2021-06-14
7 G-00006 60 73 0.0 60.0 April 2021 2021-06-15 2021-06-15 2021-06-15
8 G-00006 0 0 20.0 0.0 April 2021 2021-06-15 2021-06-15 2021-04-01
9 G-00007 60 345 110.0 60.0 June 2021 2021-06-15 2021-06-15 2021-06-15
10 G-00008 89 890 NaN NaN June 2021 2021-06-13 2021-06-13 2021-06-13
11 G-00009 0 0 0.0 50.0 May 2021 2021-06-16 2021-06-16 2021-06-16
12 G-00009 0 0 0.0 0.0 May 2021 2021-06-16 2021-06-16 2021-05-01
13 G-00010 6 45 16.0 60.0 June 2021 2021-06-13 2021-06-13 2021-06-13
14 G-00011 3 39 23.0 30.0 June 2021 2021-06-13 2021-06-13 2021-06-13
15 G-00012 2 34 72.0 20.0 June 2021 2021-06-13 2021-06-13 2021-06-13
16 G-00013 4 89 48.0 40.0 June 2021 2021-06-16 2021-06-16 2021-06-16
17 G-00014 32 127 0.0 10.0 April 2021 2021-05-23 2021-05-23 2021-05-23
18 G-00014 0 0 35.0 0.0 April 2021 2021-05-23 2021-05-23 2021-04-01
19 G-00015 3 84 0.0 120.0 April 2021 2021-05-13 2021-05-13 2021-05-13
20 G-00015 0 0 28.0 0.0 April 2021 2021-05-13 2021-05-13 2021-04-01
21 G-00016 1 100 25.0 140.0 March 2021 2021-03-26 2021-03-26 2021-03-26
22 G-00017 23 525 39.0 90.0 May 2021 2021-05-13 2021-05-13 2021-05-13