For example if I have a large dataframe of all individuals in a zoo and two columns are Animal_Common_Name and Animal_Scientific_Name. I suspect one of those is redundant as one characteristic is totally determined by the other and viceversa. Basically are the same charasteristic but renamed.
Is there any fuction that selected two different columns tell you so?
CodePudding user response:
df['Animal_Common_Name'].equals(df['Animal_Scientific_Name'])
This should return True if they're the same and false if not.
CodePudding user response:
You can use the pandas.Series.equals()
method.
For example:
import pandas as pd
data = {
'Column1': [1, 2, 3, 4],
'Column2': [1, 2, 3, 4],
'Column3': [5, 6, 7, 8]
}
df = pd.DataFrame(data)
# True
print(df['Column1'].equals(df['Column2']))
# False
print(df['Column1'].equals(df['Column3']))
Found via GeeksForGeeks
CodePudding user response:
You can use the vectorized operations of pandas to quickly determine your redundancies. Here's an example:
import pandas as pd
# create a sample dataframe from some data
d = {'name1': ['Zebra', 'Lion', 'Seagull', 'Spider'],
'name2': ['Zebra', 'Lion', 'Bird', 'Insect']}
df = pd.DataFrame(data=d)
# create a new column for your test:
df['is_redundant'] = ''
# select your empty column where the redundancy exists:
df['is_redundant'][df['name1']==df['name2']] = 1
print(df)
name1 name2 is_redundant
0 Zebra Zebra 1
1 Lion Lion 1
2 Seagull Bird
3 Spider Insect
You can then replace the empties with 0 or leave as is depending on your application.