How to map categorical columns from two dataframes?-CodePudding

I have two dataframes and in the column 'column_to_compare value from df1' I have "sub-categorical" data which I want to match to the "names" to its "categories".

I listed the matching categories below in the matching_cat variable. I don't know if it is the right way to associate my variable to a category.

I want to create a new dataframe which compares these two dataframes with a common id and on the columns categories and names.

    import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
    'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)


df1

data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
          'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }

df2 = pd.DataFrame(data2)

df2


matching_cat = {'bird':  ['falcon', 'toucan', 'eagle'],
            'dog': ['golden retriever', 'doberman'],
              'insect':['spider','mosquito'] }

So, here are my two dataframes:

And I want to be able to "map" the values with the categories and output this:

CodePudding user response：

Ok using your example, here is what I came up with:

import pandas as pd

data1 = {'key_col': ['12563','12564','12565','12566'],
    'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)

data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
          'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }

df2 = pd.DataFrame(data2)

matching_category = {'bird': ['falcon', 'toucan', 'eagle'],
                      'dog': ['golden retriever', 'doberman'],
                   'insect': ['spider','mosquito'] }


# Function to compare rows against matching_category
def test(row):
    try:
        if row['names'] in matching_category[row['categories']]:
            val = True
        else:
            val = False
    except:
        val=False

    return val

# Merge the two dataframes based on 'key_col'
df3 = df1.merge(df2, how='outer', on='key_col')

# Call the test function on each row in the new dataframe (df3)
df3['new_col'] = df3.apply(test, axis=1)

# Drop unwanted columns
df3 = df3.drop(['column3', 'column_randomn'], axis=1)


# Create new dataframes based on whether output matches or not
output_matches = df3[df3['new_col']==True]
output_mismatches = df3[df3['new_col']==False]


# Display the dataframes 
print('OUTPUT MATCHES:')
print('================================================')
print(output_matches)

print("")

print('OUTPUT MIS-MATCHES:')
print('================================================')
print(output_mismatches)

OUTPUT:

OUTPUT MATCHES:
================================================
  key_col categories             names  new_col
0   12563       bird            falcon     True
1   12564        dog  golden retriever     True
3   12566     insect            spider     True

OUTPUT MIS-MATCHES:
================================================
  key_col categories     names  new_col
2   12565     insect  doberman    False
4   12567        NaN    spider    False