Home > Enterprise >  iterate on rows of dataframe with conditional flag value in python
iterate on rows of dataframe with conditional flag value in python

Time:07-30

I'm doing a cross-check between 2 dataframes to assign a value to a flag. If a specific key is present in both dataframes with a different value, the flag will be set to "change" for that row. If the value is the same, the flag will be set to "no change". However if a specific key is present more than once in only one of the 2 dataframes, then the value of the flag will be "add". Let me give an example to make it clearer:

df 1:

key value key value present in df 2
abcd 1 False
wxyz 5 True

df 2:

key value key value present in df 1
abcd 2 False
wxyz 5 True

Then the result will be for dataframe 1:

df 1:

key value key value present in df 2 xcheck_flag
abcd 1 False change
wxyz 5 True no change

To get this result I use the following logic:

def changeType(df1):
    def condition_check(row):
            if (row['key value present in df 2'] == False):
                    return 'change'
            else:
                    return 'no change'
    df1['xcheck_flag']= df1.apply(condition_check, axis=1)

Now this is rather straightforward, right? Well I have a complication which I haven't been able to solve, yet.

Imagine the following use case:

df 1:

key value key value present in df 2
abcd 1 False
wxyz 5 True
abcd 3 False

df 2:

key value key value present in df 1
abcd 2 False
wxyz 5 True

In this case, the key abcd appears twice in df 1 and only once in df 2. If this happen, I need to apply the following logic when doing the cross-dataframe check: the first time I will match the key with dataframe 2, then set the value of the flag to change like in previous case; the second time we match the value, then set the flag to "additional change". It doesn't matter which row from df 1 gets assigned the value "change" or "additional". The only condition is that when you have such a case, only one key-value gets assigned with "change" and then all the others that might happen get assigned with "additional"

This give us:

df 1:

key value key value present in df 2 xcheck_flag
abcd 1 False change
wxyz 5 True no change
abcd 3 True additional change

I've been trying to adapt my initial function to include this behaviour but without success.

If you have any hint, it would be greatly welcomed!

CodePudding user response:

I would probably do something like this:

import pandas as pd

df1 = pd.DataFrame({'key': ['abcd', 'wxyz', 'abcd'], 'value': [1, 5, 3]})
df2 = pd.DataFrame({'key': ['abcd', 'wxyz'], 'value': [2, 5]})
df1['key_duplicated'] = df1.duplicated('key', keep='first')
df3 = df1.join(df2.set_index(['key']), rsuffix='_2', on=['key'])

which gives you a dataframe which I think contains all the columns you need to calculate the flags you're interested in:

    key    value  key_duplicated  value_2
0  abcd        1           False        2
1  wxyz        5           False        5
2  abcd        3            True        2

note if the key is not present in df2 value_2 will be NaN.

CodePudding user response:

One solution could be using a dictionary to store the number of occurrences of each key:

def check(key, value, df2):
    flag = ''
    if seen[key] > 0:
        flag = 'additional change'
    else:
        if value == df2[df2['key']==key]['value'].tolist()[0]:
            flag = 'no change'
        else:
            flag = 'change'
    seen[key]  = 1
    return flag

seen = {k: 0 for k in df1['key'].tolist()}   
df1['flag'] = df1.apply(lambda row: check(row['key'], row['value'], df2), axis=1)
  • Related