Home > OS >  Pandas merge rows where a certain column data is split over multiple rows
Pandas merge rows where a certain column data is split over multiple rows

Time:02-21

Extracting a table from PDF resulted in the following dataframe:

          Date      Transaction Details  Withdrawals  Deposits   Balance
0   01-01-2020  Tx1-Description - Line1       1625.0       NaN  97994.82
1          NaN                   Line 2          NaN       NaN       NaN
2   01-01-2020  Tx2-Description - Line1          NaN  84994.82  90000.00
3          NaN                   Line 2          NaN       NaN       NaN
4          NaN                   Line 3          NaN       NaN       NaN
5   02-01-2020  Tx3-Description - Line1         71.0       NaN  84923.82
6          NaN                   Line 2          NaN       NaN       NaN
7   02-01-2020  Tx4-Description - Line1          NaN     80.00  90000.00
8          NaN                   Line 2          NaN       NaN       NaN
9          NaN                   Line 3          NaN       NaN       NaN
10  03-01-2020  Tx5-Description - Line1        100.0       NaN  85000.00

How can I merge Transaction Details column correctly?

Desired output:

          Date      Transaction Details              Withdrawals  Deposits  Balance
0   01-01-2020  Tx1-Description - Line1 Line 2         1625.0      NaN       97994.82
1   01-01-2020  Tx2-Description - Line1 Line 2 Line 3  NaN         84994.82  90000.00
2   02-01-2020  Tx3-Description - Line1 Line 2         71.0        NaN       84923.82
3   02-01-2020  Tx4-Description - Line1 Line 2 Line 3  NaN         80.00     90000.00
4   03-01-2020  Tx5-Description - Line1                100.0       NaN       85000.00

CodePudding user response:

IIUC, you can groupby using the "Date" to form groups, then aggregate:

(df.groupby(df['Date'].notna().cumsum(), as_index=False)
   .agg({'Date': 'first', 'Transaction Details': ' '.join,
         'Withdrawals': 'sum', 'Deposits': 'sum', 'Balance': 'sum'})
)

NB. Note that the NaNs became 0, but you can replace(0, float('nan')) if needed

output:

         Date                    Transaction Details  Withdrawals  Deposits   Balance
0  01-01-2020         Tx1-Description - Line1 Line 2       1625.0      0.00  97994.82
1  01-01-2020  Tx2-Description - Line1 Line 2 Line 3          0.0  84994.82  90000.00
2  02-01-2020         Tx3-Description - Line1 Line 2         71.0      0.00  84923.82
3  02-01-2020  Tx4-Description - Line1 Line 2 Line 3          0.0     80.00  90000.00
4  03-01-2020                Tx5-Description - Line1        100.0      0.00  85000.00
  • Related