How to join all unique strings from multiple columns of each group in pandas-CodePudding

Consider the sample df below:

import pandas as pd

d = {'id': ["A123", "A123", "A123"],
     'text1': ["this is a sample", "this is a sample", "this is a sample"],
     'text2': ["sing with me", "one two three", "sing with me"]}
df = pd.DataFrame(data=d)

I'm trying to take the id column id and concat the unique values of each of the text columns, so that the sample df:

id    text1              text2
A123  this is a sample   sing with me
A123  this is a sample   one two three
A123  this is a sample   sing with me

Will look like this:

id    combined_text
A123  this is a sample | sing with me | one two three

I tried all sort of combination of " | ".join(x) and agg and more... I can take d['id','text1'].unique() and d['id','text2'].unique() and later merge, but there must be a more efficient way.

CodePudding user response：

melt and groupby the id, then join the set of values:

(df.melt('id').groupby('id')['value']
   .agg(lambda x: ' | '.join(set(x)))
   .reset_index(name='combined_text'))

	id	combined_text
0	A123	one two three \| sing with me \| this is a sample

Or if order is important, change set(x) to x.unique():

(df.melt('id').groupby('id')['value']
   .agg(lambda x: ' | '.join(x.unique()))
   .reset_index(name='combined_text'))

	id	combined_text
0	A123	this is a sample \| sing with me \| one two three

CodePudding user response：

I'd suggest using stack() and unique() here. I also broke this up by group, using the id column.

import pandas as pd

d = {'id': ["A123", "A123", "A123"],
     'text1': ["this is a sample", "this is a sample", "this is a sample"],
     'text 2': ["sing with me", "one two three", "sing with me"]}
df = pd.DataFrame(data=d)

df = pd.DataFrame(
    df.groupby('id')
      .apply(
          lambda df: ' | '.join(df[['text1', 'text 2']].stack().unique())
      ),
columns=['combined_text'])