Extracting a table from PDF resulted in the following dataframe:
Date Transaction Details Withdrawals Deposits Balance
0 01-01-2020 Tx1-Description - Line1 1625.0 NaN 97994.82
1 NaN Line 2 NaN NaN NaN
2 01-01-2020 Tx2-Description - Line1 NaN 84994.82 90000.00
3 NaN Line 2 NaN NaN NaN
4 NaN Line 3 NaN NaN NaN
5 02-01-2020 Tx3-Description - Line1 71.0 NaN 84923.82
6 NaN Line 2 NaN NaN NaN
7 02-01-2020 Tx4-Description - Line1 NaN 80.00 90000.00
8 NaN Line 2 NaN NaN NaN
9 NaN Line 3 NaN NaN NaN
10 03-01-2020 Tx5-Description - Line1 100.0 NaN 85000.00
How can I merge Transaction Details
column correctly?
Desired output:
Date Transaction Details Withdrawals Deposits Balance
0 01-01-2020 Tx1-Description - Line1 Line 2 1625.0 NaN 97994.82
1 01-01-2020 Tx2-Description - Line1 Line 2 Line 3 NaN 84994.82 90000.00
2 02-01-2020 Tx3-Description - Line1 Line 2 71.0 NaN 84923.82
3 02-01-2020 Tx4-Description - Line1 Line 2 Line 3 NaN 80.00 90000.00
4 03-01-2020 Tx5-Description - Line1 100.0 NaN 85000.00
CodePudding user response:
IIUC, you can groupby
using the "Date" to form groups, then aggregate:
(df.groupby(df['Date'].notna().cumsum(), as_index=False)
.agg({'Date': 'first', 'Transaction Details': ' '.join,
'Withdrawals': 'sum', 'Deposits': 'sum', 'Balance': 'sum'})
)
NB. Note that the NaNs became 0, but you can replace(0, float('nan'))
if needed
output:
Date Transaction Details Withdrawals Deposits Balance
0 01-01-2020 Tx1-Description - Line1 Line 2 1625.0 0.00 97994.82
1 01-01-2020 Tx2-Description - Line1 Line 2 Line 3 0.0 84994.82 90000.00
2 02-01-2020 Tx3-Description - Line1 Line 2 71.0 0.00 84923.82
3 02-01-2020 Tx4-Description - Line1 Line 2 Line 3 0.0 80.00 90000.00
4 03-01-2020 Tx5-Description - Line1 100.0 0.00 85000.00