How to agglomerate rows/lines in Python Pandas, based on one reference column?-CodePudding

Let's say we have a DataFrame like this in Pandas:

         0                                1                     2          3               4
0     Date                      Description                              ABC             DEF
1                                                     LOREM ISPUM                    1234.00
2    01/08          LOREM IPSUM LOREM IPSUM                            12.34                    
3                      ABCDEF ABCDEF ACBDEF                                                      
4    02/08                   DOLOR SIT AMET                            56.78                    
5               CONSECTETUR ADIPISCING ELIT                                                              
6                     SED DO EIUSMOD TEMPOR                                                              
7    07/08                      LOREM ISPUM                            90.12
8    08/08                   DOLOR SIT AMET                                            34.56

Rows can take 1 line (#7 or #8), but can also be multiline: for example for lines #2-3, or #4-5-6, you can see that the column 1 / Description is split on several lines.

How to merge rows such that they all take one line, based on a reference column (here 0 / Date ), with Pandas?

Expected output:

         0                                                                  1                     2          3               4
0     Date                                                        Description                              ABC             DEF
1                                                                                       LOREM ISPUM                    1234.00
2    01/08                       LOREM IPSUM LOREM IPSUM ABCDEF ABCDEF ACBDEF                            12.34                    
3    02/08   DOLOR SIT AMET CONSECTETUR ADIPISCING ELIT SED DO EIUSMOD TEMPOR                            56.78                                                                                  
4    07/08                                                        LOREM ISPUM                            90.12
5    08/08                                                     DOLOR SIT AMET                                            34.56

Note: if, after the header "Date, Description, , ABC, DEF", a row has an empty Date, don't agglomerate until one non-empty date comes: here line #1 is not agglomerated.

CodePudding user response：

You may use the groupby and agg functions as below:

import pandas as pd

data = {'date':['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
        'description':['A', 'B', 'C', 'D', 'E'],
        'value1':range(1,6),
        'value2':range(1,11,2)}

df = pd.DataFrame(data=data)

df2 = df.groupby('date').agg(
    description_agg=pd.NamedAgg(column='description', aggfunc=lambda x: x.str.cat(sep=', ')),
    value1_agg=pd.NamedAgg(column='value1', aggfunc='sum'),
    value2_agg=pd.NamedAgg(column='value2', aggfunc='mean')
)

There are some ways to aggregate data as you can see on the documentation (https://pandas.pydata.org/docs/user_guide/groupby.html). I use this way because you can control the result column name and function applied on each column.

CodePudding user response：

For the date rows i.e. rows from 2 to end (df.iloc[2:]) do ffill on '0' column and groupby and agg using ' '.join

Then concat the aggregated frame to the first two rows df.iloc[:2]

Code:

grouper = df.iloc[2:]['0'].str.replace(r'\s |^$',repl=lambda x:np.nan,regex=True).ffill()
out = (
    pd.concat([df.iloc[:2], 
               df.iloc[2:].groupby(grouper, as_index=False).agg(' '.join)]).
    reset_index(drop=True)
)

print(out):

         0                                                  1            2  \
0     Date                                        Description                
1                                                              LOREM ISPUM   
2   01/08        LOREM IPSUM LOREM IPSUM ABCDEF ABCDEF ACBDEF                
3  02/08    DOLOR SIT AMET CONSECTETUR ADIPISCING ELIT SED...                
4    07/08                                        LOREM ISPUM                
5    08/08                                     DOLOR SIT AMET                

         3        4  
0      ABC      DEF  
1           1234.00  
2   12.34            
3  56.78             
4    90.12           
5             34.56