Home > Net >  Join string columns and remove duplicates and others in Pandas
Join string columns and remove duplicates and others in Pandas

Time:10-18

I asked a similar question a while ago and the solution works until there's a case where the string has more than two words - "dragon fruit". Also there're some duplicates cannot be removed completely. How can I overcome this edge case?

data = {'fruit1':["organge, apple", "apple", "organge, dragon fruit", "organge, others", "others"],
        'fruit2':["apple, organge", "others", "dragon fruit, organge", "watermelon", "others"]}
df = pd.DataFrame(data)

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
                   .apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
                   .replace('', 'others')
                 )

                  fruit1                 fruit2
0         organge, apple         apple, organge
1                  apple                 others
2  organge, dragon fruit  dragon fruit, organge
3        organge, others             watermelon
4                 others                 others

Expected result:

                  fruit1                 fruit2                              together
0         organge, apple         apple, organge                        apple, organge
1                  apple                 others                                 apple
2  organge, dragon fruit  dragon fruit, organge                 organge, dragon fruit
3        organge, others             watermelon                   organge, watermelon
4                 others                 others                                others

CodePudding user response:

You need regex=True to really strip the others:

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan, regex=True)
                   .apply(lambda x: ', '.join(pd.unique(sum(x.str.split(', ').dropna(), []))), axis=1)
                   .replace('', 'others')
                 )

Output:

>>> df
                  fruit1                 fruit2               together
0         organge, apple         apple, organge         organge, apple
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  organge, dragon fruit
3        organge, others             watermelon             watermelon
4                 others                 others                 others
>>> 

CodePudding user response:

You could use:

df['together'] = (df.apply(lambda c: c.str.split(', '))
                    .apply(lambda r: ', '.join(sorted(set(r['fruit1']).union(r['fruit2']).difference(['others']))), axis=1)
                    .replace('', 'others')
                  )

or:

df['together'] = (df.apply(lambda r: ', '.join(sorted(set(r['fruit1'].split(', ')).union(r['fruit2'].split(', ')).difference(['others']))), axis=1)
                    .replace('', 'others')
                  )

output:

                  fruit1                 fruit2               together
0         organge, apple         apple, organge         apple, organge
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  dragon fruit, organge
3        organge, others             watermelon    organge, watermelon
4                 others                 others                 others

CodePudding user response:

You can split each column with comma, concat the list of entries of all columns, then join text after removing duplicates and excluding others. Finally, replace empty string with a single others, as follows:

df["together"] = (
    df[['fruit1', 'fruit2']]
        .apply(lambda x: x.str.split(',\s*'))
        .sum(axis=1).map(np.array)
        .apply(lambda x: ', '.join(pd.unique(x[x != 'others'])))        
        .replace('', 'others')
    )

Result:

print(df)

                  fruit1                 fruit2               together
0         organge, apple         apple, organge         organge, apple
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  organge, dragon fruit
3        organge, others             watermelon    organge, watermelon
4                 others                 others                 others

If you want the final entries be sorted in ascending order, you can replace pd.unique by np.unique, as follows:

df["together"] = (
    df[['fruit1', 'fruit2']]
        .apply(lambda x: x.str.split(',\s*'))
        .sum(axis=1).map(np.array)
        .apply(lambda x: ', '.join(np.unique(x[x != 'others'])))        
        .replace('', 'others')
    )

Result:

print(df)

                  fruit1                 fruit2               together
0         organge, apple         apple, organge         apple, organge
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  dragon fruit, organge
3        organge, others             watermelon    organge, watermelon
4                 others                 others                 others
  • Related