Home > OS >  pandas: removing duplicate values in rows with same index in two columns
pandas: removing duplicate values in rows with same index in two columns

Time:12-07

I have a dataframe as follows:

import numpy as np
import pandas as pd
df = pd.DataFrame({'text':['she is good', 'she is bad'], 'label':['she is good', 'she is good']})

I would like to compare row wise and if two same-indexed rows have the same values, replace the duplicate in the 'label' column with the word 'same'.

Desired output:

           pos        label
0  she is good      same

1   she is bad  she is good

so far, i have tried the following, but it returns an error:

ValueError: Length of values (1) does not match length of index (2)

df['label'] =np.where(df.query("text == label"), df['label']== ' ',df['label']==df['label'] )

CodePudding user response:

Your syntax is not correct, have a look at the documentation of numpy.where. Check for equality between your two columns, and replace the values in your label column:

import numpy as np
df['label'] = np.where(df['text'].eq(df['label']),'same',df['label'])

prints:

          text        label
0  she is good         same
1   she is bad  she is good
  • Related