Function Looping Over Multiple Dataframes to Create New Categorical Data Column-CodePudding

I'm trying to create a new category column from two different Pandas dataframes that contain the same columns using Pandas.

The new column looks at both Df1 and Df2's 'Loan Code/Number' column, determines if they're the same then determines if the 'Del_Cat' category changed from Df1 to Df2, and if it changed categories, it will return a new result.

DF1:

Loan Code/Number	Days Delinquent	Del_Cat
1147623994	-25	Current
1501719058	-5	Current
1501719696	77	61-90
1502624989	87	61-90
1502625152	16	Current

DF2:

Loan Code/Number	Days Delinquent	Del_Cat
1147623994	-22	Current
1801719152	37	31-60
1501719696	84	61-90
1602624414	-6	Current
1502625152	55	31-60

I've tried creating a function that loops over the values in the tables and determines if the loan number is the same, and if it is then to check the values in the 'Del_Cat' column, bringing back a new value that states if the value has changed:

def del_cat_migration(df1, df2): 
    if [df1['Loan Code/Number'] == df2['Loan Code/Number']]:
        if (df1['Del_Cat']) == 'Current' & (df2['Del_Cat']) == 'Current':
            return 'Same - Current'
        elif (df1['Del_Cat']) == 'Current' & (df2['Del_Cat']) == '31-60':
            return 'Current to 31-60'
        elif (df1['Del_Cat']) == '31-60' & (df2['Del_Cat']) == '31-60':
            return 'Same - 31-60'
        elif (df1['Del_Cat']) == '31-60' & (df2['Del_Cat']) == '61-90':
            return '31-60 to 61-90'
        elif (df1['Del_Cat']) == '61-90' & (df2['Del_Cat']) == '61-90':
            return 'Same - 61-90'
        elif (df1['Del_Cat']) == '61-90' & (df2['Del_Cat']) == 'Current':
            return '61-90 to Current'
        else:
            return 'Other'

I then try to apply this new data to a new column in df2:

df2['Del_Migration_Cat'] = df2.apply(del_cat_migration(df1, df2), axis = 1)

I then get the following error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

CodePudding user response：

There's ways to do such things in pandas:

def del_cat_migration(row):
    if row[0] == row[1]:
        return f"Same - {row[0]}"
    
    return f"{row[0]} to {row[1]}"

df3 = pd.concat([df1["Del_Cat"], df2["Del_Cat"]], axis=1)

df3 looks like

   Del_Cat  Del_Cat
0  Current  Current
1  Current    31-60
2    61-90    61-90
3    61-90  Current
4  Current    31-60

Now apply the function, row-wise (axis=1):

>>> df3.apply(del_cat_migration, axis=1)
0      Same - Current
1    Current to 31-60
2        Same - 61-90
3    61-90 to Current
4    Current to 31-60

CodePudding user response：

The cause: you cannot do the comparison in if statement between two columns of a dataframe because the result of this comparison is a boolean series [True, False, ....] which is invalid condition, therefore, you should use a.empty, a.bool(), a.item(), a.any() or a.all() as indicated by the error log. The suggested option is invalid because you need only to compare between each entry in df1 and the corresponding entry in df2. I fixed this issue as in the solution below and also correct some syntax errors in the function.

But again, your function will have a problem with df2.apply() will iterate through each row and call the function in each row with the entire dataframes which is not valid operation, as well.

def del_cat_migration(df1, df2):
    bool_index = (df1['Loan Code/Number'] == df2['Loan Code/Number'])
    
    for i in df1.index[bool_index]:
        if (df1.loc[i,'Del_Cat'] == 'Current') & (df2.loc[i,'Del_Cat'] == 'Current'):
        df2.loc[i,'Del_Migration_Cat'] = 'Same - Current'
    elif (df1.loc[i,'Del_Cat'] == 'Current') & (df2.loc[i,'Del_Cat'] == '31-60'):
        df2.loc[i,'Del_Migration_Cat'] = 'Current to 31-60'
    elif (df1.loc[i,'Del_Cat'] == '31-60') & (df2.loc[i,'Del_Cat']  == '31-60'):
        df2.loc[i,'Del_Migration_Cat'] = 'Same - 31-60'
    elif (df1.loc[i,'Del_Cat'] == '31-60') & (df2.loc[i,'Del_Cat'] == '61-90'):
        df2.loc[i,'Del_Migration_Cat'] = '31-60 to 61-90'
    elif (df1.loc[i,'Del_Cat'] == '61-90') & (df2.loc[i,'Del_Cat'] == '61-90'):
        df2.loc[i,'Del_Migration_Cat'] = 'Same - 61-90'
    elif (df1.loc[i,'Del_Cat'] == '61-90') & (df2.loc[i,'Del_Cat']  == 'Current'):
        df2.loc[i,'Del_Migration_Cat'] = '61-90 to Current'
    else:
        df2.loc[i,'Del_Migration_Cat'] = 'Other'

del_cat_migration(df1, df2)

But I prefer to follow the answer of @fsimonjetz, however, you need for an extra dataframe df3.

CodePudding user response：

first, let's merge df's on 'Loan Code/Number' column, then we can apply function:

df3 = pd.merge(df1, df2[['Loan Code/Number','Del_Cat']], 
               on='Loan Code/Number',
               how='left',
               suffixes=(None, '_2'))
print(df3)
'''
   Loan Code/Number  Days Delinquent  Del_Cat Del_Cat_2
0        1147623994              -25  Current   Current
1        1501719058               -5  Current       NaN
2        1501719696               77    61-90     61-90
3        1502624989               87    61-90       NaN
4        1502625152               16  Current     31-60
'''
df3 = (df3.assign(Del_Cat=df3.apply(lambda x: f"same {x['Del_Cat']}" 
                                    if x['Del_Cat']==x['Del_Cat_2'] 
                                    else 'Other' if x['Del_Cat_2']!=x['Del_Cat_2']
                                    else f"{x['Del_Cat']} to {x['Del_Cat_2']}", axis=1)).
       drop(columns='Del_Cat_2'))
print(df3)
'''
   Loan Code/Number  Days Delinquent           Del_Cat
0        1147623994              -25      same Current
1        1501719058               -5             Other
2        1501719696               77        same 61-90
3        1502624989               87             Other
4        1502625152               16  Current to 31-60