Join string columns and remove duplicates and others in Pandas-CodePudding

I asked a similar question a while ago and the solution works until there's a case where the string has more than two words - "dragon fruit". Also there're some duplicates cannot be removed completely. How can I overcome this edge case?

data = {'fruit1':["organge, apple", "apple", "organge, dragon fruit", "organge, others", "others"],
        'fruit2':["apple, organge", "others", "dragon fruit, organge", "watermelon", "others"]}
df = pd.DataFrame(data)

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
                   .apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
                   .replace('', 'others')
                 )

                  fruit1                 fruit2
0         organge, apple         apple, organge
1                  apple                 others
2  organge, dragon fruit  dragon fruit, organge
3        organge, others             watermelon
4                 others                 others

Expected result:

                  fruit1                 fruit2                              together
0         organge, apple         apple, organge                        apple, organge
1                  apple                 others                                 apple
2  organge, dragon fruit  dragon fruit, organge                 organge, dragon fruit
3        organge, others             watermelon                   organge, watermelon
4                 others                 others                                others

CodePudding user response：

You need regex=True to really strip the others:

df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan, regex=True)
                   .apply(lambda x: ', '.join(pd.unique(sum(x.str.split(', ').dropna(), []))), axis=1)
                   .replace('', 'others')
                 )

Output:

>>> df
                  fruit1                 fruit2               together
0         organge, apple         apple, organge         organge, apple
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  organge, dragon fruit
3        organge, others             watermelon             watermelon
4                 others                 others                 others
>>>

CodePudding user response：

You could use:

df['together'] = (df.apply(lambda c: c.str.split(', '))
                    .apply(lambda r: ', '.join(sorted(set(r['fruit1']).union(r['fruit2']).difference(['others']))), axis=1)
                    .replace('', 'others')
                  )

or:

df['together'] = (df.apply(lambda r: ', '.join(sorted(set(r['fruit1'].split(', ')).union(r['fruit2'].split(', ')).difference(['others']))), axis=1)
                    .replace('', 'others')
                  )

output:

                  fruit1                 fruit2               together
0         organge, apple         apple, organge         apple, organge
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  dragon fruit, organge
3        organge, others             watermelon    organge, watermelon
4                 others                 others                 others

CodePudding user response：

You can split each column with comma, concat the list of entries of all columns, then join text after removing duplicates and excluding others. Finally, replace empty string with a single others, as follows:

df["together"] = (
    df[['fruit1', 'fruit2']]
        .apply(lambda x: x.str.split(',\s*'))
        .sum(axis=1).map(np.array)
        .apply(lambda x: ', '.join(pd.unique(x[x != 'others'])))        
        .replace('', 'others')
    )

Result:

print(df)

                  fruit1                 fruit2               together
0         organge, apple         apple, organge         organge, apple
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  organge, dragon fruit
3        organge, others             watermelon    organge, watermelon
4                 others                 others                 others

If you want the final entries be sorted in ascending order, you can replace pd.unique by np.unique, as follows:

df["together"] = (
    df[['fruit1', 'fruit2']]
        .apply(lambda x: x.str.split(',\s*'))
        .sum(axis=1).map(np.array)
        .apply(lambda x: ', '.join(np.unique(x[x != 'others'])))        
        .replace('', 'others')
    )

Result:

print(df)

                  fruit1                 fruit2               together
0         organge, apple         apple, organge         apple, organge
1                  apple                 others                  apple
2  organge, dragon fruit  dragon fruit, organge  dragon fruit, organge
3        organge, others             watermelon    organge, watermelon
4                 others                 others                 others