I'm doing a cross-check between 2 dataframes to assign a value to a flag. If a specific key is present in both dataframes with a different value, the flag will be set to "change" for that row. If the value is the same, the flag will be set to "no change". However if a specific key is present more than once in only one of the 2 dataframes, then the value of the flag will be "add". Let me give an example to make it clearer:
df 1:
key | value | key value present in df 2 |
---|---|---|
abcd | 1 | False |
wxyz | 5 | True |
df 2:
key | value | key value present in df 1 |
---|---|---|
abcd | 2 | False |
wxyz | 5 | True |
Then the result will be for dataframe 1:
df 1:
key | value | key value present in df 2 | xcheck_flag |
---|---|---|---|
abcd | 1 | False | change |
wxyz | 5 | True | no change |
To get this result I use the following logic:
def changeType(df1):
def condition_check(row):
if (row['key value present in df 2'] == False):
return 'change'
else:
return 'no change'
df1['xcheck_flag']= df1.apply(condition_check, axis=1)
Now this is rather straightforward, right? Well I have a complication which I haven't been able to solve, yet.
Imagine the following use case:
df 1:
key | value | key value present in df 2 |
---|---|---|
abcd | 1 | False |
wxyz | 5 | True |
abcd | 3 | False |
df 2:
key | value | key value present in df 1 |
---|---|---|
abcd | 2 | False |
wxyz | 5 | True |
In this case, the key abcd appears twice in df 1 and only once in df 2. If this happen, I need to apply the following logic when doing the cross-dataframe check: the first time I will match the key with dataframe 2, then set the value of the flag to change like in previous case; the second time we match the value, then set the flag to "additional change". It doesn't matter which row from df 1 gets assigned the value "change" or "additional". The only condition is that when you have such a case, only one key-value gets assigned with "change" and then all the others that might happen get assigned with "additional"
This give us:
df 1:
key | value | key value present in df 2 | xcheck_flag |
---|---|---|---|
abcd | 1 | False | change |
wxyz | 5 | True | no change |
abcd | 3 | True | additional change |
I've been trying to adapt my initial function to include this behaviour but without success.
If you have any hint, it would be greatly welcomed!
CodePudding user response:
I would probably do something like this:
import pandas as pd
df1 = pd.DataFrame({'key': ['abcd', 'wxyz', 'abcd'], 'value': [1, 5, 3]})
df2 = pd.DataFrame({'key': ['abcd', 'wxyz'], 'value': [2, 5]})
df1['key_duplicated'] = df1.duplicated('key', keep='first')
df3 = df1.join(df2.set_index(['key']), rsuffix='_2', on=['key'])
which gives you a dataframe which I think contains all the columns you need to calculate the flags you're interested in:
key value key_duplicated value_2
0 abcd 1 False 2
1 wxyz 5 False 5
2 abcd 3 True 2
note if the key is not present in df2 value_2 will be NaN.
CodePudding user response:
One solution could be using a dictionary to store the number of occurrences of each key:
def check(key, value, df2):
flag = ''
if seen[key] > 0:
flag = 'additional change'
else:
if value == df2[df2['key']==key]['value'].tolist()[0]:
flag = 'no change'
else:
flag = 'change'
seen[key] = 1
return flag
seen = {k: 0 for k in df1['key'].tolist()}
df1['flag'] = df1.apply(lambda row: check(row['key'], row['value'], df2), axis=1)