I have a dataset
id category description status
11 A Text_1 Finished
11 A Text_2 Pause
11 A Text_3 Started
22 A Text_1 Pause
33 B Text_1 Finished
33 B Text_2 Finished
And I want to group data with id
and concat description
only for raws with status = 'Finished'
So desired output is
id category description
11 A Text_1
22 A
33 B Text_1 Text_2
I can concat it using
data.groupby(['id', 'category'])['description'].apply(' '.join).reset_index()
But how can I use condition inside this expression?
CodePudding user response:
you can filter before groupby
then reindex
with the missing groups
out = data.loc[data.status == 'Finished'].groupby(['id', 'category'])['description'].apply(' '.join).reindex(pd.MultiIndex.from_frame(data[['id','category']].drop_duplicates()),fill_value= ' ').reset_index()
Out[70]:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
CodePudding user response:
You can use groupby.apply
with a conditional and a default value if the group is empty after filtering:
out = (df
.groupby(['id', 'category'])
.apply(lambda g: ' '.join(d['description'])
if len(d:=g[g['status'].eq('Finished')])
else '' )
.reset_index(name='description')
)
Output:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2
CodePudding user response:
Here's a way:
key = ['id', 'category']
df2 = data[key].drop_duplicates().join(
data.query("status == 'Finished'").groupby(key).description.apply(' '.join),
on=key).fillna('').reset_index(drop=True)
Explanation:
- use
query()
to filter on status for "Finished", usegroupby()
to group by thekey
[id, category], then use yourstr.join()
logic on thedescription
column values within each group - use a de-duped version of the
key
columns andDataFrame.join()
to expand the filtered results to contains allkey
values, and usefillna()
to replaceNaN
with an empty string in thedescription
column for keys that were filtered out.
Output:
id category description
0 11 A Text_1
1 22 A
2 33 B Text_1 Text_2