I asked a similar question a while ago and the solution works until there's a case where the string has more than two words - "dragon fruit". Also there're some duplicates cannot be removed completely. How can I overcome this edge case?
data = {'fruit1':["organge, apple", "apple", "organge, dragon fruit", "organge, others", "others"],
'fruit2':["apple, organge", "others", "dragon fruit, organge", "watermelon", "others"]}
df = pd.DataFrame(data)
df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan)
.apply(lambda x: ' '.join(pd.unique(x.dropna())), axis=1)
.replace('', 'others')
)
fruit1 fruit2
0 organge, apple apple, organge
1 apple others
2 organge, dragon fruit dragon fruit, organge
3 organge, others watermelon
4 others others
Expected result:
fruit1 fruit2 together
0 organge, apple apple, organge apple, organge
1 apple others apple
2 organge, dragon fruit dragon fruit, organge organge, dragon fruit
3 organge, others watermelon organge, watermelon
4 others others others
CodePudding user response:
You need regex=True
to really strip the others
:
df["together"] = (df[['fruit1', 'fruit2']].replace('others', np.nan, regex=True)
.apply(lambda x: ', '.join(pd.unique(sum(x.str.split(', ').dropna(), []))), axis=1)
.replace('', 'others')
)
Output:
>>> df
fruit1 fruit2 together
0 organge, apple apple, organge organge, apple
1 apple others apple
2 organge, dragon fruit dragon fruit, organge organge, dragon fruit
3 organge, others watermelon watermelon
4 others others others
>>>
CodePudding user response:
You could use:
df['together'] = (df.apply(lambda c: c.str.split(', '))
.apply(lambda r: ', '.join(sorted(set(r['fruit1']).union(r['fruit2']).difference(['others']))), axis=1)
.replace('', 'others')
)
or:
df['together'] = (df.apply(lambda r: ', '.join(sorted(set(r['fruit1'].split(', ')).union(r['fruit2'].split(', ')).difference(['others']))), axis=1)
.replace('', 'others')
)
output:
fruit1 fruit2 together
0 organge, apple apple, organge apple, organge
1 apple others apple
2 organge, dragon fruit dragon fruit, organge dragon fruit, organge
3 organge, others watermelon organge, watermelon
4 others others others
CodePudding user response:
You can split each column with comma, concat the list of entries of all columns, then join text after removing duplicates and excluding others
. Finally, replace empty string with a single others
, as follows:
df["together"] = (
df[['fruit1', 'fruit2']]
.apply(lambda x: x.str.split(',\s*'))
.sum(axis=1).map(np.array)
.apply(lambda x: ', '.join(pd.unique(x[x != 'others'])))
.replace('', 'others')
)
Result:
print(df)
fruit1 fruit2 together
0 organge, apple apple, organge organge, apple
1 apple others apple
2 organge, dragon fruit dragon fruit, organge organge, dragon fruit
3 organge, others watermelon organge, watermelon
4 others others others
If you want the final entries be sorted in ascending order, you can replace pd.unique
by np.unique
, as follows:
df["together"] = (
df[['fruit1', 'fruit2']]
.apply(lambda x: x.str.split(',\s*'))
.sum(axis=1).map(np.array)
.apply(lambda x: ', '.join(np.unique(x[x != 'others'])))
.replace('', 'others')
)
Result:
print(df)
fruit1 fruit2 together
0 organge, apple apple, organge apple, organge
1 apple others apple
2 organge, dragon fruit dragon fruit, organge dragon fruit, organge
3 organge, others watermelon organge, watermelon
4 others others others