How to create a new column showing when a change to an observation occurred?-CodePudding

I have a data-frame formatted like so:

Contract	Agreement_Date	Date
A	2017-02-10	2020-02-03
A	2017-02-10	2020-02-04
A	2017-02-11	2020-02-09
A	2017-02-11	2020-02-10
A	2017-02-11	2020-04-21
B	2017-02-14	2020-08-01
B	2017-02-15	2020-08-11
B	2017-02-17	2020-10-14
C	2017-02-11	2020-12-12
C	2017-02-11	2020-12-16

In this data-frame I have multiple observations for each contract. For some of the contracts, their Agreement_Date changes as new amendments occur. As an example, Contract A had its agreements change from 2017-02-10 to 2017-02-11, and Contract B had its agreement_date change 3 times. Contract C had no change to Agreement_Date

What I would like is an output that looks like so:

Contract	Date	Number_of_Changes
A	2020-02-09	1
B	2017-08-11	2
B	2017-10-14	2

Where the Date column shows when the change to Agreement_Date occurs (e.g. for contract A the Agreement_Date first went from 2017-02-10 to 2017-02-11 on 2020-02-09). This is shown in bold in my first table. I then want a Number_of_Changes column which simply shows how many times the Agreement_Date changed for that contract.

I have been working on this for a few hours to no avail, so any help would be appreciated.

Thanks :)

CodePudding user response：

I posted a previous answer, but realised it's not what you expected. Would this one work out though ?

#Create 'progressive' number of changes column per Contract
df['Changes']=df.groupby('Contract')['Agreement_Date'].transform(lambda x:(x!=x.shift()).cumsum())-1


#Assign to new df, filter for changes and drop duplicates assuming it's already sorted per 'Date'
newdf=df[df['Changes']>0].drop_duplicates(subset=['Contract','Changes'])[['Contract','Date','Changes']]

#Reassign values of 'changes' for max 'Change' per Contract
newdf['Changes']=newdf.groupby('Contract')['Changes'].transform('max')

newdf

CodePudding user response：

This problem revolves around setting up some pieces for later computational use. You'll need multiple passes to

shift the dates & retrieve the records where the changes occur
calculate the number of changes that occurred

We can do this by working with the groupby object in 2 steps.

contract_grouped = df.groupby('Contract')['Agreement_Date']

# subset data based on whether or not a date change occurred
shifted_dates = contract_grouped.shift()
changed_df = df.loc[
    shifted_dates.ne(df['Agreement_Date']) & shifted_dates.notnull()
].copy()

# calculate counts and assign back to df
changed_df['count'] = changed_df['Contract'].map(contract_grouped.nunique() - 1)
del changed_df['Date'] # unneeded column

print(changed_df)
  Contract Agreement_Date  count
2        A     2017-02-11      1
6        B     2017-02-15      2
7        B     2017-02-17      2

Here is the same approach written out with method chaining & assignment expression syntax. If the above is more readable to you, please use that. I put this here mainly because I enjoy writing my pandas answers both ways.

changed_df = (
    df.groupby('Contract')['Agreement_Date']
    .pipe(lambda grouper: 
        df.loc[
            (shifted := grouper.shift()).ne(df['Agreement_Date'])
             & shifted.notnull()
        ]
    .assign(count=lambda d: d['Contract'].map(grouper.nunique().sub(1)))
    .drop(columns='Date')
    )
)

print(changed_df)
  Contract Agreement_Date  count
2        A     2017-02-11      1
6        B     2017-02-15      2
7        B     2017-02-17      2