How can I make sure that a change in one dataframe is reflected by a similar change in another dataf-CodePudding

I'm working with two dataframes similar to this:

import pandas as pd
records_A = pd.DataFrame({'id': [1001,1001,1001,1002,1002,1002,1003,1003,1003,1004,1005,1006], 
'location': ['NJ','OH','OH','CA','CA','CA','NJ','NJ','NC','PA','UT','AZ'],
                     'date': ['1/1','4/1','6/1','1/1','4/1','6/1','1/1','4/1','6/1','1/1','1/1','1/1']})

records_B = pd.DataFrame({'id': [1001,1001,1002,1003,1004,1005], 
'plan start': ['1/1','4/1','1/1','1/1','1/1','1/1'],
                     'plan end': ['3/31','12/31','12/31','12/31','12/31','12/31']})

I need to perform a check to ensure that a location change in records_A corresponds with a plan change in records_B but I can't quite envision how to accomplish this pythonically.

For example, id 1001 moved from NJ to OH on 4/1, as shown in records_A. This corresponds to a plan start date on 4/1 in records_B so this is valid.

Compare with id 1003. This person moved from NJ to NC on 6/1. But in records_B, there is no plan change corresponding to this location move. I need to flag this.

I've started to approach this problem by analyzing records_A for location changes. After some help on here, I tried creating a location change column

records_A['location_change?']=np.where(records_A.groupby('id').location.apply(lambda x: x!=x.iloc[0]),'Changed','Unchanged')

This helps identify changes in records_A but I can't figure out how to then compare these results to records_B. In addition, in the case of people with more than 2 records, it's comparing the location to the original, even though the 3rd row's location is unchanged (1001 did not move on 6/1)

Any ideas?

Thanks!

CodePudding user response：

Not sure I understand exactly what you want, but you can do an outer join of both dataframes on columns id and date for records_A, and id and plan start for records_B. This way, changes not present in records_B will be automatically "flagged" with a NaN value:

import pandas as pd

records_A = pd.DataFrame({
    'id': [1001,1001,1001,1002,1002,1002,1003,1003,1003,1004,1005,1006], 
    'location': ['NJ','OH','OH','CA','CA','CA','NJ','NJ','NC','PA','UT','AZ'],
    'date': ['1/1','4/1','6/1','1/1','4/1','6/1','1/1','4/1','6/1','1/1','1/1','1/1'],
})
records_B = pd.DataFrame({
    'id': [1001,1001,1002,1003,1004,1005], 
    'plan start': ['1/1','4/1','1/1','1/1','1/1','1/1'],
    'plan end': ['3/31','12/31','12/31','12/31','12/31','12/31'],
})

merged_df = records_A.merge(
    records_B,
    left_on=['id', 'date'],
    right_on=['id', 'plan start'],
    how='outer',
)
print(merged_df)

# output:
#       id location date plan start plan end
# 0   1001       NJ  1/1        1/1     3/31
# 1   1001       OH  4/1        4/1    12/31
# 2   1001       OH  6/1        NaN      NaN
# 3   1002       CA  1/1        1/1    12/31
# 4   1002       CA  4/1        NaN      NaN
# 5   1002       CA  6/1        NaN      NaN
# 6   1003       NJ  1/1        1/1    12/31
# 7   1003       NJ  4/1        NaN      NaN
# 8   1003       NC  6/1        NaN      NaN
# 9   1004       PA  1/1        1/1    12/31
# 10  1005       UT  1/1        1/1    12/31
# 11  1006       AZ  1/1        NaN      NaN

CodePudding user response：

You don't need to use apply for flagging the location changes. Here is a full working solution.

import pandas as pd

#flag location changed in records_A
records_A['location_changed'] = records_A.groupby('id')['location'].\
                                shift().\
                                bfill().\
                                ne(records_A['location']).astype(int)

#merge records
merged_records = records_A.merge(records_B, \
                                 left_on=['id', 'date'], \
                                 right_on=['id', 'plan_start'], \
                                 how='outer')

#flag potential errors in records_B
merged_records['flag'] = 0
merged_records.loc[(merged_records['location_changed'] == 1) & (merged_records['plan_start'].isnull()), 'flag'] = 1

This will output the following where the flag column represents errors.

    id  location date location_changed plan_start plan_end flag
0   1001    NJ    1/1         0           1/1       3/31    0
1   1001    OH    4/1         1           4/1       12/31   0
2   1001    OH    6/1         0           NaN       NaN     0
3   1002    CA    1/1         0           1/1       12/31   0
4   1002    CA    4/1         0           NaN       NaN     0
5   1002    CA    6/1         0           NaN       NaN     0
6   1003    NJ    1/1         0           1/1       12/31   0
7   1003    NJ    4/1         0           NaN       NaN     0
8   1003    NC    6/1         1           NaN       NaN     1
9   1004    PA    1/1         1           1/1       12/31   0
10  1005    UT    1/1         1           1/1       12/31   0
11  1006    AZ    1/1         1           NaN       NaN     1