Let's say we have a DataFrame like this in Pandas:
0 1 2 3 4
0 Date Description ABC DEF
1 LOREM ISPUM 1234.00
2 01/08 LOREM IPSUM LOREM IPSUM 12.34
3 ABCDEF ABCDEF ACBDEF
4 02/08 DOLOR SIT AMET 56.78
5 CONSECTETUR ADIPISCING ELIT
6 SED DO EIUSMOD TEMPOR
7 07/08 LOREM ISPUM 90.12
8 08/08 DOLOR SIT AMET 34.56
Rows can take 1 line (#7 or #8), but can also be multiline: for example for lines #2-3, or #4-5-6, you can see that the column 1 / Description
is split on several lines.
How to merge rows such that they all take one line, based on a reference column (here 0 / Date
), with Pandas?
Expected output:
0 1 2 3 4
0 Date Description ABC DEF
1 LOREM ISPUM 1234.00
2 01/08 LOREM IPSUM LOREM IPSUM ABCDEF ABCDEF ACBDEF 12.34
3 02/08 DOLOR SIT AMET CONSECTETUR ADIPISCING ELIT SED DO EIUSMOD TEMPOR 56.78
4 07/08 LOREM ISPUM 90.12
5 08/08 DOLOR SIT AMET 34.56
Note: if, after the header "Date, Description, , ABC, DEF", a row has an empty Date, don't agglomerate until one non-empty date comes: here line #1 is not agglomerated.
CodePudding user response:
You may use the groupby and agg functions as below:
import pandas as pd
data = {'date':['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
'description':['A', 'B', 'C', 'D', 'E'],
'value1':range(1,6),
'value2':range(1,11,2)}
df = pd.DataFrame(data=data)
df2 = df.groupby('date').agg(
description_agg=pd.NamedAgg(column='description', aggfunc=lambda x: x.str.cat(sep=', ')),
value1_agg=pd.NamedAgg(column='value1', aggfunc='sum'),
value2_agg=pd.NamedAgg(column='value2', aggfunc='mean')
)
There are some ways to aggregate data as you can see on the documentation (https://pandas.pydata.org/docs/user_guide/groupby.html). I use this way because you can control the result column name and function applied on each column.
CodePudding user response:
For the date rows i.e. rows from 2 to end (df.iloc[2:]
) do ffill
on '0' column and groupby
and agg
using ' '.join
Then concat the aggregated frame to the first two rows df.iloc[:2]
Code:
grouper = df.iloc[2:]['0'].str.replace(r'\s |^$',repl=lambda x:np.nan,regex=True).ffill()
out = (
pd.concat([df.iloc[:2],
df.iloc[2:].groupby(grouper, as_index=False).agg(' '.join)]).
reset_index(drop=True)
)
print(out):
0 1 2 \
0 Date Description
1 LOREM ISPUM
2 01/08 LOREM IPSUM LOREM IPSUM ABCDEF ABCDEF ACBDEF
3 02/08 DOLOR SIT AMET CONSECTETUR ADIPISCING ELIT SED...
4 07/08 LOREM ISPUM
5 08/08 DOLOR SIT AMET
3 4
0 ABC DEF
1 1234.00
2 12.34
3 56.78
4 90.12
5 34.56