Home > Blockchain >  pandas: compare columns row-wise and remove duplicates compred to the first column
pandas: compare columns row-wise and remove duplicates compred to the first column

Time:12-13

I have a dataframe as follows:

import pandas as pd
data = {'name': ['the weather is good', ' we need fresh air','today is sunny', 'we are lucky'],
        'name_1': ['we are lucky','the weather is good', ' we need fresh air','today is sunny'],
        'name_2': ['the weather is good', 'today is sunny', 'we are lucky',' we need fresh air'],
        'name_3': [ 'today is sunny','the weather is good',' we need fresh air', 'we are lucky']}
df = pd.DataFrame(data)

I would like to compare the columns row-wise (meaning rows with the same index are to be compared) and replace the duplicates with the word 'same' if they have the same value as the first column. my desired output is:

                  name               name_1               name_2  \
0  the weather is good         we are lucky               same   
1    we need fresh air  the weather is good       today is sunny   
2       today is sunny    we need fresh air         we are lucky   
3         we are lucky       today is sunny    we need fresh air   

                name_3  
0       today is sunny  
1  the weather is good  
2    we need fresh air  
3           same

to find these values I tried the following:

import numpy as np
np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']))

but to replace them I do not know how to formulate the (condition, x,y) for np.where().the following return same for column 'name' and 'name_3':

np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']),'same',df)

CodePudding user response:

IIUC, you want to check which values in the columns 'name_1','name_2','name_3' have the same value in column name, and if yes, replace those values to 'same', otherwise leave them as they are. You are correct in using numpy.where, but try rewriting your statement to:

import numpy as np

cols = ['name_1','name_2','name_3']
for c in cols:
    df[c] = np.where(df['name'].eq(df[c]),'same',df[c])

Which gives you:

                  name               name_1              name_2  \
0  the weather is good         we are lucky                same   
1    we need fresh air  the weather is good      today is sunny   
2       today is sunny    we need fresh air        we are lucky   
3         we are lucky       today is sunny   we need fresh air   

                name_3  
0       today is sunny  
1  the weather is good  
2    we need fresh air  
3                 same  
  • Related