Is there a neat way to aggregate columns into a new column without duplicating information?
For example, if I have a df:
Description Information
0 text1 text1
1 text2 text3
2 text4 text5
And I want to create a new column called 'Combined', which aggregates 'Description' and 'Information' to get:
Description Information Combined
0 text1 text1 text1
1 text2 text3 text2 text3
2 text4 text5 text4 text5
So far I have been using np.where and [mask] to check for duplicates before aggregating with df['Combined'] = df[['Description', 'Information']].agg(' '.join, axis=1)
Although this works, it is not practical on a larger scale, grateful if anyone knows of a simpler way!
CodePudding user response:
Add all of the first column and add the second column only where they differ:
In [17]: df
Out[17]:
a b
0 t1 t1
1 t2 t3
2 t4 t5
In [18]: df['c'] = df['a']
In [19]: df
Out[19]:
a b c
0 t1 t1 t1
1 t2 t3 t2
2 t4 t5 t4
In [20]: df.loc[df.a!=df.b, 'c'] = df['c'] ' ' df['b']
In [21]: df
Out[21]:
a b c
0 t1 t1 t1
1 t2 t3 t2 t3
2 t4 t5 t4 t5
CodePudding user response:
You can use numpy.where
:
import numpy as np
df['Combined'] = np.where(df['Description'] != df['Information'],
df['Information'] ' ' df['Description'],
df['Information'])
print(df)
# Output
Description Information Combined
0 text1 text1 text1
1 text2 text3 text3 text2
2 text4 text5 text5 text4
CodePudding user response:
You can first run unique
:
df['Combined'] = (df[['Description', 'Information']]
.agg(lambda x: ' '.join(x.unique()), axis=1)
)
Output:
Description Information Combined
0 text1 text1 text1
1 text2 text3 text2 text3
2 text4 text5 text4 text5
CodePudding user response:
You can do something like this:
import pandas as pd
def combine(row):
if row['Description'] == row['Information']:
return row['Description']
else:
return row['Description'] ' ' row['Information']
df = pd.read_csv('test2.csv')
df['new'] = df.apply(combine, axis=1)
print(df)
Description Information new
0 text1 text1 text1
1 text2 text3 text2 text3
2 text4 text5 text4 text5
CodePudding user response:
Another possible solution, based on list comprehension
:
df['Combined'] = [x[0] ' ' x[1] if x[0] != x[1] else x[0]
for x in zip(df.Description, df.Information)]
Output:
Description Information Combined
0 text1 text1 text1
1 text2 text3 text2 text3
2 text4 text5 text4 text5