I have a data-frame formatted like so:
Contract | Agreement_Date | Date |
---|---|---|
A | 2017-02-10 | 2020-02-03 |
A | 2017-02-10 | 2020-02-04 |
A | 2017-02-11 | 2020-02-09 |
A | 2017-02-11 | 2020-02-10 |
A | 2017-02-11 | 2020-04-21 |
B | 2017-02-14 | 2020-08-01 |
B | 2017-02-15 | 2020-08-11 |
B | 2017-02-17 | 2020-10-14 |
C | 2017-02-11 | 2020-12-12 |
C | 2017-02-11 | 2020-12-16 |
In this data-frame I have multiple observations for each contract. For some of the contracts, their Agreement_Date changes as new amendments occur. As an example, Contract A had its agreements change from 2017-02-10 to 2017-02-11, and Contract B had its agreement_date change 3 times. Contract C had no change to Agreement_Date
What I would like is an output that looks like so:
Contract | Date | Number_of_Changes |
---|---|---|
A | 2020-02-09 | 1 |
B | 2017-08-11 | 2 |
B | 2017-10-14 | 2 |
Where the Date column shows when the change to Agreement_Date occurs (e.g. for contract A the Agreement_Date first went from 2017-02-10 to 2017-02-11 on 2020-02-09). This is shown in bold in my first table. I then want a Number_of_Changes column which simply shows how many times the Agreement_Date changed for that contract.
I have been working on this for a few hours to no avail, so any help would be appreciated.
Thanks :)
CodePudding user response:
I posted a previous answer, but realised it's not what you expected. Would this one work out though ?
#Create 'progressive' number of changes column per Contract
df['Changes']=df.groupby('Contract')['Agreement_Date'].transform(lambda x:(x!=x.shift()).cumsum())-1
#Assign to new df, filter for changes and drop duplicates assuming it's already sorted per 'Date'
newdf=df[df['Changes']>0].drop_duplicates(subset=['Contract','Changes'])[['Contract','Date','Changes']]
#Reassign values of 'changes' for max 'Change' per Contract
newdf['Changes']=newdf.groupby('Contract')['Changes'].transform('max')
newdf
CodePudding user response:
This problem revolves around setting up some pieces for later computational use. You'll need multiple passes to
- shift the dates & retrieve the records where the changes occur
- calculate the number of changes that occurred
We can do this by working with the groupby object in 2 steps.
contract_grouped = df.groupby('Contract')['Agreement_Date']
# subset data based on whether or not a date change occurred
shifted_dates = contract_grouped.shift()
changed_df = df.loc[
shifted_dates.ne(df['Agreement_Date']) & shifted_dates.notnull()
].copy()
# calculate counts and assign back to df
changed_df['count'] = changed_df['Contract'].map(contract_grouped.nunique() - 1)
del changed_df['Date'] # unneeded column
print(changed_df)
Contract Agreement_Date count
2 A 2017-02-11 1
6 B 2017-02-15 2
7 B 2017-02-17 2
Here is the same approach written out with method chaining & assignment expression syntax. If the above is more readable to you, please use that. I put this here mainly because I enjoy writing my pandas
answers both ways.
changed_df = (
df.groupby('Contract')['Agreement_Date']
.pipe(lambda grouper:
df.loc[
(shifted := grouper.shift()).ne(df['Agreement_Date'])
& shifted.notnull()
]
.assign(count=lambda d: d['Contract'].map(grouper.nunique().sub(1)))
.drop(columns='Date')
)
)
print(changed_df)
Contract Agreement_Date count
2 A 2017-02-11 1
6 B 2017-02-15 2
7 B 2017-02-17 2