Is there any way in a Python dataframe to see if two columns are the same but with renamed values?-CodePudding

For example if I have a large dataframe of all individuals in a zoo and two columns are Animal_Common_Name and Animal_Scientific_Name. I suspect one of those is redundant as one characteristic is totally determined by the other and viceversa. Basically are the same charasteristic but renamed.

Is there any fuction that selected two different columns tell you so?

CodePudding user response：

df['Animal_Common_Name'].equals(df['Animal_Scientific_Name'])

This should return True if they're the same and false if not.

CodePudding user response：

You can use the pandas.Series.equals() method.

For example:

import pandas as pd

data = {
    'Column1': [1, 2, 3, 4],
    'Column2': [1, 2, 3, 4],
    'Column3': [5, 6, 7, 8]
}

df = pd.DataFrame(data)

# True
print(df['Column1'].equals(df['Column2']))

# False
print(df['Column1'].equals(df['Column3']))

Found via GeeksForGeeks

CodePudding user response：

You can use the vectorized operations of pandas to quickly determine your redundancies. Here's an example:

import pandas as pd

# create a sample dataframe from some data
d = {'name1': ['Zebra', 'Lion', 'Seagull', 'Spider'],
     'name2': ['Zebra', 'Lion', 'Bird', 'Insect']}
df = pd.DataFrame(data=d)

# create a new column for your test:
df['is_redundant'] = ''

# select your empty column where the redundancy exists:
df['is_redundant'][df['name1']==df['name2']] = 1

print(df)


    name1   name2   is_redundant
0   Zebra   Zebra   1
1   Lion    Lion    1
2   Seagull Bird    
3   Spider  Insect

You can then replace the empties with 0 or leave as is depending on your application.