Home > Blockchain >  Pandas: groupby and concat strings with condition
Pandas: groupby and concat strings with condition

Time:06-21

I have a dataset

id   category   description   status
11   A          Text_1        Finished
11   A          Text_2        Pause
11   A          Text_3        Started
22   A          Text_1        Pause
33   B          Text_1        Finished
33   B          Text_2        Finished

And I want to group data with id and concat description only for raws with status = 'Finished'

So desired output is

id    category   description
11    A          Text_1
22    A          
33    B          Text_1 Text_2

I can concat it using

data.groupby(['id', 'category'])['description'].apply(' '.join).reset_index()

But how can I use condition inside this expression?

CodePudding user response:

you can filter before groupby then reindex with the missing groups

out = data.loc[data.status == 'Finished'].groupby(['id', 'category'])['description'].apply(' '.join).reindex(pd.MultiIndex.from_frame(data[['id','category']].drop_duplicates()),fill_value= ' ').reset_index()
Out[70]: 
   id category    description
0  11        A         Text_1
1  22        A               
2  33        B  Text_1 Text_2

CodePudding user response:

You can use groupby.apply with a conditional and a default value if the group is empty after filtering:

out = (df
 .groupby(['id', 'category'])
 .apply(lambda g: ' '.join(d['description'])
                  if len(d:=g[g['status'].eq('Finished')])
                  else '' )
 .reset_index(name='description')
 )

Output:

   id category    description
0  11        A         Text_1
1  22        A               
2  33        B  Text_1 Text_2

CodePudding user response:

Here's a way:

key = ['id', 'category']
df2 = data[key].drop_duplicates().join(
    data.query("status == 'Finished'").groupby(key).description.apply(' '.join), 
    on=key).fillna('').reset_index(drop=True)

Explanation:

  • use query() to filter on status for "Finished", use groupby() to group by the key [id, category], then use your str.join() logic on the description column values within each group
  • use a de-duped version of the key columns and DataFrame.join() to expand the filtered results to contains all key values, and use fillna() to replace NaN with an empty string in the description column for keys that were filtered out.

Output:

   id category    description
0  11        A         Text_1
1  22        A
2  33        B  Text_1 Text_2
  • Related