Home > Net >  Aggregating df columns but not duplicates
Aggregating df columns but not duplicates

Time:01-10

Is there a neat way to aggregate columns into a new column without duplicating information?

For example, if I have a df:

     Description  Information
  0       text1     text1
  1       text2     text3
  2       text4     text5

And I want to create a new column called 'Combined', which aggregates 'Description' and 'Information' to get:

     Description  Information  Combined
  0       text1     text1        text1
  1       text2     text3      text2 text3
  2       text4     text5      text4 text5

So far I have been using np.where and [mask] to check for duplicates before aggregating with df['Combined'] = df[['Description', 'Information']].agg(' '.join, axis=1)

Although this works, it is not practical on a larger scale, grateful if anyone knows of a simpler way!

CodePudding user response:

Add all of the first column and add the second column only where they differ:

In [17]: df
Out[17]:
    a   b
0  t1  t1
1  t2  t3
2  t4  t5

In [18]: df['c'] = df['a']

In [19]: df
Out[19]:
    a   b   c
0  t1  t1  t1
1  t2  t3  t2
2  t4  t5  t4

In [20]: df.loc[df.a!=df.b, 'c'] = df['c']   ' '   df['b']

In [21]: df
Out[21]:
    a   b      c
0  t1  t1     t1
1  t2  t3  t2 t3
2  t4  t5  t4 t5

CodePudding user response:

You can use numpy.where:

import numpy as np

df['Combined'] = np.where(df['Description'] != df['Information'], 
                          df['Information']   ' '   df['Description'], 
                          df['Information'])
print(df)

# Output
  Description Information     Combined
0       text1       text1        text1
1       text2       text3  text3 text2
2       text4       text5  text5 text4

CodePudding user response:

You can first run unique:

df['Combined'] = (df[['Description', 'Information']]
                  .agg(lambda x: ' '.join(x.unique()), axis=1)
                 )

Output:

  Description Information     Combined
0       text1       text1        text1
1       text2       text3  text2 text3
2       text4       text5  text4 text5

CodePudding user response:

You can do something like this:

import pandas as pd


def combine(row):
    if row['Description'] == row['Information']:
        return row['Description']
    else:
        return row['Description']   ' '   row['Information']

df = pd.read_csv('test2.csv')
df['new'] = df.apply(combine, axis=1)
print(df)


  Description Information          new
0       text1       text1        text1
1       text2       text3  text2 text3
2       text4       text5  text4 text5

CodePudding user response:

Another possible solution, based on list comprehension:

df['Combined'] = [x[0]   ' '   x[1] if x[0] != x[1] else x[0]
                  for x in zip(df.Description, df.Information)]

Output:

  Description Information     Combined
0       text1       text1        text1
1       text2       text3  text2 text3
2       text4       text5  text4 text5
  • Related