Home > Net >  Pandas group (unique) strings on different columns
Pandas group (unique) strings on different columns

Time:12-05

Consider the sample df below:

import pandas as pd

d = {'id': ["A123", "A123", "A123"],
     'text1': ["this is a sample", "this is a sample", "this is a sample"],
     'text2': ["sing with me", "one two three", "sing with me"]}
df = pd.DataFrame(data=d)

I'm trying to take the id column id and concat the unique values of each of the text columns, so that the sample df:

id    text1              text2
A123  this is a sample   sing with me
A123  this is a sample   one two three
A123  this is a sample   sing with me

Will look like this:

id    combined_text
A123  this is a sample | sing with me | one two three

I tried all sort of combination of " | ".join(x) and agg and more... I can take d['id','text1'].unique() and d['id','text2'].unique() and later merge, but there must be a more efficient way.

CodePudding user response:

I'd suggest using stack() and unique() here. I also broke this up by group, using the id column.

import pandas as pd

d = {'id': ["A123", "A123", "A123"],
     'text1': ["this is a sample", "this is a sample", "this is a sample"],
     'text 2': ["sing with me", "one two three", "sing with me"]}
df = pd.DataFrame(data=d)

df = pd.DataFrame(
    df.groupby('id')
      .apply(
          lambda df: ' | '.join(df[['text1', 'text 2']].stack().unique())
      ),
columns=['combined_text'])

CodePudding user response:

melt and groupby the id, then join the set of values:

(df.melt('id').groupby('id')['value']
   .agg(lambda x: ' | '.join(set(x)))
   .reset_index(name='combined_text'))
id combined_text
0 A123 one two three | sing with me | this is a sample
  • Related