Consider the sample df
below:
import pandas as pd
d = {'id': ["A123", "A123", "A123"],
'text1': ["this is a sample", "this is a sample", "this is a sample"],
'text2': ["sing with me", "one two three", "sing with me"]}
df = pd.DataFrame(data=d)
I'm trying to take the id column id
and concat the unique
values of each of the text columns, so that the sample df:
id text1 text2
A123 this is a sample sing with me
A123 this is a sample one two three
A123 this is a sample sing with me
Will look like this:
id combined_text
A123 this is a sample | sing with me | one two three
I tried all sort of combination of " | ".join(x)
and agg
and more... I can take d['id','text1'].unique()
and d['id','text2'].unique()
and later merge, but there must be a more efficient way.
CodePudding user response:
melt
and groupby
the id
, then join the set
of values:
(df.melt('id').groupby('id')['value']
.agg(lambda x: ' | '.join(set(x)))
.reset_index(name='combined_text'))
id | combined_text | |
---|---|---|
0 | A123 | one two three | sing with me | this is a sample |
Or if order is important, change set(x)
to x.unique()
:
(df.melt('id').groupby('id')['value']
.agg(lambda x: ' | '.join(x.unique()))
.reset_index(name='combined_text'))
id | combined_text | |
---|---|---|
0 | A123 | this is a sample | sing with me | one two three |
CodePudding user response:
I'd suggest using stack() and unique() here. I also broke this up by group, using the id column.
import pandas as pd
d = {'id': ["A123", "A123", "A123"],
'text1': ["this is a sample", "this is a sample", "this is a sample"],
'text 2': ["sing with me", "one two three", "sing with me"]}
df = pd.DataFrame(data=d)
df = pd.DataFrame(
df.groupby('id')
.apply(
lambda df: ' | '.join(df[['text1', 'text 2']].stack().unique())
),
columns=['combined_text'])