I'm working with two dataframes similar to this:
import pandas as pd
records_A = pd.DataFrame({'id': [1001,1001,1001,1002,1002,1002,1003,1003,1003,1004,1005,1006],
'location': ['NJ','OH','OH','CA','CA','CA','NJ','NJ','NC','PA','UT','AZ'],
'date': ['1/1','4/1','6/1','1/1','4/1','6/1','1/1','4/1','6/1','1/1','1/1','1/1']})
records_B = pd.DataFrame({'id': [1001,1001,1002,1003,1004,1005],
'plan start': ['1/1','4/1','1/1','1/1','1/1','1/1'],
'plan end': ['3/31','12/31','12/31','12/31','12/31','12/31']})
I need to perform a check to ensure that a location change in records_A corresponds with a plan change in records_B but I can't quite envision how to accomplish this pythonically.
For example, id 1001 moved from NJ to OH on 4/1, as shown in records_A. This corresponds to a plan start date on 4/1 in records_B so this is valid.
Compare with id 1003. This person moved from NJ to NC on 6/1. But in records_B, there is no plan change corresponding to this location move. I need to flag this.
I've started to approach this problem by analyzing records_A for location changes. After some help on here, I tried creating a location change column
records_A['location_change?']=np.where(records_A.groupby('id').location.apply(lambda x: x!=x.iloc[0]),'Changed','Unchanged')
This helps identify changes in records_A but I can't figure out how to then compare these results to records_B. In addition, in the case of people with more than 2 records, it's comparing the location to the original, even though the 3rd row's location is unchanged (1001 did not move on 6/1)
Any ideas?
Thanks!
CodePudding user response:
Not sure I understand exactly what you want, but you can do an outer join of both dataframes on columns id
and date
for records_A
, and id
and plan start
for records_B
. This way, changes not present in records_B
will be automatically "flagged" with a NaN
value:
import pandas as pd
records_A = pd.DataFrame({
'id': [1001,1001,1001,1002,1002,1002,1003,1003,1003,1004,1005,1006],
'location': ['NJ','OH','OH','CA','CA','CA','NJ','NJ','NC','PA','UT','AZ'],
'date': ['1/1','4/1','6/1','1/1','4/1','6/1','1/1','4/1','6/1','1/1','1/1','1/1'],
})
records_B = pd.DataFrame({
'id': [1001,1001,1002,1003,1004,1005],
'plan start': ['1/1','4/1','1/1','1/1','1/1','1/1'],
'plan end': ['3/31','12/31','12/31','12/31','12/31','12/31'],
})
merged_df = records_A.merge(
records_B,
left_on=['id', 'date'],
right_on=['id', 'plan start'],
how='outer',
)
print(merged_df)
# output:
# id location date plan start plan end
# 0 1001 NJ 1/1 1/1 3/31
# 1 1001 OH 4/1 4/1 12/31
# 2 1001 OH 6/1 NaN NaN
# 3 1002 CA 1/1 1/1 12/31
# 4 1002 CA 4/1 NaN NaN
# 5 1002 CA 6/1 NaN NaN
# 6 1003 NJ 1/1 1/1 12/31
# 7 1003 NJ 4/1 NaN NaN
# 8 1003 NC 6/1 NaN NaN
# 9 1004 PA 1/1 1/1 12/31
# 10 1005 UT 1/1 1/1 12/31
# 11 1006 AZ 1/1 NaN NaN
CodePudding user response:
You don't need to use apply for flagging the location changes. Here is a full working solution.
import pandas as pd
#flag location changed in records_A
records_A['location_changed'] = records_A.groupby('id')['location'].\
shift().\
bfill().\
ne(records_A['location']).astype(int)
#merge records
merged_records = records_A.merge(records_B, \
left_on=['id', 'date'], \
right_on=['id', 'plan_start'], \
how='outer')
#flag potential errors in records_B
merged_records['flag'] = 0
merged_records.loc[(merged_records['location_changed'] == 1) & (merged_records['plan_start'].isnull()), 'flag'] = 1
This will output the following where the flag column represents errors.
id location date location_changed plan_start plan_end flag
0 1001 NJ 1/1 0 1/1 3/31 0
1 1001 OH 4/1 1 4/1 12/31 0
2 1001 OH 6/1 0 NaN NaN 0
3 1002 CA 1/1 0 1/1 12/31 0
4 1002 CA 4/1 0 NaN NaN 0
5 1002 CA 6/1 0 NaN NaN 0
6 1003 NJ 1/1 0 1/1 12/31 0
7 1003 NJ 4/1 0 NaN NaN 0
8 1003 NC 6/1 1 NaN NaN 1
9 1004 PA 1/1 1 1/1 12/31 0
10 1005 UT 1/1 1 1/1 12/31 0
11 1006 AZ 1/1 1 NaN NaN 1