Compare two columns and remove duplicates on row

I am trying to remove duplicates on each row in a column, comparing with the correspondent row in the second column (same df) For example, if column initial_data contains value present in additional_data, then remove that value from initial_data row and create a new column with that data.

my df

initial_data                                          additional_data
HJC, casco helmets integral de moto                   HJC Helmets
Silverstone Technology  Concentrador de Datos         SilverStone Technology
Urban Leather UR-52 Chaqueta de Cuero                 URBAN 5884

expected output

new_data
casco integral de moto
Concentrador de Datos 
Leather UR-52 Chaqueta de Cuero


df = [['HJC, casco helmets integral de moto', 'HJC Helmets'], ['Silverstone Technology  Concentrador de Datos ', 'SilverStone Technology'], ['Urban Leather UR-52 Chaqueta de Cuero', 'URBAN 5884']]
df = pd.DataFrame(df, columns = ['initial_data', 'additional_data'])

A few mentions, I need to keep the same order, same casing, basically, just to remove matching words, without changing anything else.

Thank you so much for any suggestion you may share. I have tried multiple versions before publishing the question but nothing worked (zip, lists, etc).

CodePudding user response：

not much of a comprehensive answer, but enough syntax to get you started:

a = df['initial_data'].str.lower().str.split().explode().reset_index()
b = df['additional_data'].str.lower().str.split().explode().reset_index()
'''
[df] a:
 ---- --------- ---------------- 
|    |   index | initial_data   |
|---- --------- ----------------|
|  0 |       0 | hjc            |
|  1 |       0 | casco          |
|  2 |       0 | helmets        |
|  3 |       0 | integral       |
|  4 |       0 | de             |
|  5 |       0 | moto           |
|  6 |       1 | silverstone    |
|  7 |       1 | technology     |
|  8 |       1 | concentrador   |
|  9 |       1 | de             |
| 10 |       1 | datos          |
| 11 |       2 | urban          |
| 12 |       2 | leather        |
| 13 |       2 | ur-52          |
| 14 |       2 | chaqueta       |
| 15 |       2 | de             |
| 16 |       2 | cuero          |
 ---- --------- ---------------- 
'''
a.columns=['index', 'new_data']
b.columns=['index', 'new_data']

b = b.loc[b['new_data'].isin(a['new_data'])]
'''
[df] b:
 ---- --------- ------------- 
|    |   index | new_data    |
|---- --------- -------------|
|  0 |       0 | hjc         |
|  1 |       0 | helmets     |
|  2 |       1 | silverstone |
|  3 |       1 | technology  |
|  4 |       2 | urban       |
 ---- --------- -------------     
'''
c = pd.concat([a, b], axis=0).drop_duplicates(keep=False) # << KEY IDEA/SYNTAX
'''
[df] c:
 ---- --------- -------------- 
|    |   index | new_data     |
|---- --------- --------------|
|  1 |       0 | casco        |
|  3 |       0 | integral     |
|  4 |       0 | de           |
|  5 |       0 | moto         |
|  8 |       1 | concentrador |
|  9 |       1 | de           |
| 10 |       1 | datos        |
| 12 |       2 | leather      |
| 13 |       2 | ur-52        |
| 14 |       2 | chaqueta     |
| 15 |       2 | de           |
| 16 |       2 | cuero        |
 ---- --------- --------------    
''' 
c.groupby('index')['new_data'].agg(lambda x: ' '.join(x))

output:

index
0             casco integral de moto
1              concentrador de datos
2    leather ur-52 chaqueta de cuero
Name: new_data, dtype: object

CodePudding user response：

This problem is a bit more complex than it might first appear. Generally, I think you need two puzzle pieces to solve it. First, you need a way to iterate through rows in a pandas DataFrame and generate a column value from the contents to two columns, which is where the apply() function is handy. Use the parameter axis=1 to specify that you're going row by row, not column by column.

Second, you need to break the strings into tokens so you can compare lists. This is sometimes called text alignment. In my example, I'm performing a one-way alignment, checking to see if any terms from the left column are absent from the right but, in theory, there could also be terms in the right column that aren't in the left column. To make sure the words will match, my function compares lowercase versions of all words and omits any punctuation, such as the comma in your example (though maybe you want to keep that?).

import string

mydict = {'initial_data':['HJC, casco helmets integral de moto', 'Silverstone Technology  Concentrador de Datos', 'Urban Leather UR-52 Chaqueta de Cuero'], 
          'additional_data':['HJC Helmets', 'SilverStone Technology', 'URBAN 5884'] }
df = pd.DataFrame(mydict)

def align_columns(row):
    left = row['initial_data'].split()
    right = row['additional_data'].split()
    
    unmatched = []
    for i in left:
        word = "".join([z for z in i.lower() if z not in string.punctuation])
        if word not in [r.lower() for r in right]:
            unmatched.append(i)
    return " ".join(unmatched)
df['new_data'] = df.apply(align_columns, axis=1)
df