You can concatenate specific column values in a multi-column duplicate row by doing the following, but all columns other than those specified in the groupby
will disappear.
The columns title, thumbnail, name, created_at
are present.
Retrieve the rows with duplicate title
and thumbnail
as follows
and concatenate the values of the name
column of the duplicated row while keeping the first row.
I want to create a new column for it.
However, as mentioned earlier, columns other than those specified in groupby
will disappear.
df.groupby(['title', 'thumbnail'])['name'].apply(lambda x: ' '.join(x)).reset_index()
CodePudding user response:
Suppose the following dataframe:
>>> df
title thumbnail name created_at
0 title1 thumb1 name1 today
1 title1 thumb1 name2 yesterday
2 title1 thumb2 name3 another day
The output of your code is:
>>> df.groupby(['title', 'thumbnail'], as_index=False)['name'] \
.apply(' '.join)
title thumbnail name
0 title1 thumb1 name1 name2
1 title1 thumb2 name3
If you don't want to lost columns and rows (keep the shape), use transform
:
df['name'] = df.groupby(['title', 'thumbnail'])['name'] \
.transform(' '.join)
print(df)
# Output:
title thumbnail name created_at
0 title1 thumb1 name1 name2 today
1 title1 thumb1 name1 name2 yesterday
2 title1 thumb2 name3 another day
Else you have to make a choice with other columns to keep them. In this case, do you want to keep 'today' or 'yesterday' for created_at
? To do that, you can use agg
:
>>> df.groupby(['title', 'thumbnail']) \
.agg({'name': ' '.join, 'created_at': 'first'}) \
.reset_index()
title thumbnail name created_at
0 title1 thumb1 name1 name2 today
1 title1 thumb2 name3 another day
Setup:
data = {'title': ['title1', 'title1', 'title1'],
'thumbnail': ['thumb1', 'thumb1', 'thumb2'],
'name': ['name1', 'name2', 'name3'],
'created_at': ['today', 'yesterday', 'another day']}
df = pd.DataFrame(data)
CodePudding user response:
Using a toy DataFrame
for illustration:
df = pd.DataFrame({
'title': ['tom', 'tom', 'tom', 'mark', 'mark', 'lewis'],
'name': list('abcdef'),
'marks': [55, 99, 14, 28, 19, 88]
})
In any case we will need to group:
groups = df.groupby(['title', 'thumbnail'])
Here is a neat solution using a join:
groups.first().join(groups['name'].agg(' '.join), rsuffix='s')
A more efficient solution would get the name aggregation and the rest of the columns in a single pass:
def process(group):
result = group.iloc[0] # take the first row
# then add a concatenation of all names for this group
result['schools'] = ' '.join(group['name'])
# return the result data frame with a single row
return result
This could also be done in a single line:
def process(group):
return group.iloc[[0]].assign(names=' '.join(group['name']))
Then simply apply the helper function to all the groups:
groups.apply(process)
The two methods get the same results:
title name marks names
name
lewis lewis f 88 f
mark mark d 28 d e
tom tom a 55 a b c
CodePudding user response:
That's because you're selecting the name
column via ['']
, so by definition, the only columns available are going to be the columns that make up the index (which is required) and the column you're selecting.
Instead of calling apply
on the ["name"]
column of the groupby, call apply
directly on the groupby:
df.groupby(['title', 'thumbnail']).apply(lambda x: ' '.join(x['name'])).reset_index()