DataFrame cleaning-CodePudding

I have an excel spreadsheet which, when imported looks similar to:

df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})

2021-08-01	2021-09-01	2021-10-01	2021-11-01	2021-12-01
120	NaN	NaN	80	NaN
NaN	NaN	40	NaN	20
NaN	50	NaN	50	NaN
NaN	NaN	100	NaN	NaN
300	NaN	NaN	NaN	NaN

I'm looking (via python) to transform it into something like this:

shouldbe = pd.DataFrame({
"PayDate1": 
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))

PayDate1	Amount1	PayDate2	Amount2
2021-08-01	120	2021-11-01	80
2021-10-01	40	2021-12-01	20
2021-09-01	50	2021-11-01	50
2021-10-01	100	NaT	NaN
2021-08-01	300	NaT	NaN

I'm looking for some examples of how to achieve this kind of transformation, thanks in advance for any help.

CodePudding user response：

You can use melt, groupby and pivot to get the expected dataframe:

Reshape your dataframe with melt:

out = df.reset_index() \
        .melt(id_vars='index', var_name='PayDate', value_name='Amount') \
        .dropna()
print(out)

# Output
    index    PayDate  Amount
0       0 2021-08-01   120.0  # <- index 0, 1st occurrence
4       4 2021-08-01   300.0  # <- index 4, 1st occurrence
7       2 2021-09-01    50.0  # <- index 2, 1st occurrence
11      1 2021-10-01    40.0  # <- index 1, 1st occurrence
13      3 2021-10-01   100.0  # <- index 3, 1st occurrence
15      0 2021-11-01    80.0  # <- index 0, 2nd occurrence
17      2 2021-11-01    50.0  # <- index 2, 2nd occurrence
21      1 2021-12-01    20.0  # <- index 1, 2nd occurrence

Group by index and apply cumcount to create index of new columns ('1' & '2' as string for future concatenation):

out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)

# Output:
    index    PayDate  Amount  col
0       0 2021-08-01   120.0    1
4       4 2021-08-01   300.0    1
7       2 2021-09-01    50.0    1
11      1 2021-10-01    40.0    1
13      3 2021-10-01   100.0    1
15      0 2021-11-01    80.0    2
17      2 2021-11-01    50.0    2
21      1 2021-12-01    20.0    2

Pivot the dataframe

out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)

# Output
         PayDate            Amount      
col            1          2      1     2
index                                   
0     2021-08-01 2021-11-01  120.0  80.0
1     2021-10-01 2021-12-01   40.0  20.0
2     2021-09-01 2021-11-01   50.0  50.0
3     2021-10-01        NaT  100.0   NaN
4     2021-08-01        NaT  300.0   NaN

Get the final dataframe

cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None

out = out[out.columns[cols]]
print(out)
    PayDate1 Amount1   PayDate2 Amount2
0 2021-08-01   120.0 2021-11-01    80.0
1 2021-10-01    40.0 2021-12-01    20.0
2 2021-09-01    50.0 2021-11-01    50.0
3 2021-10-01   100.0        NaT     NaN
4 2021-08-01   300.0        NaT     NaN

CodePudding user response：

Purely for completeness, here's the non-pandas way of doing it:

import math
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})

dates = df.columns
out = {k: [] for k in dates}

for row in df.iterrows():
    for i, val in enumerate(row[1]):
        d = dates[i]
        if not math.isnan(val):
            out[d].append(val)

print(out)

This is not pandasy (indeed the final output here is not even a pandas dataframe, although it's trivial to transform it back into one), but I claim it's a good deal easier to read and thus More Pythonic (TM). More to the point it might suit some use-cases better.

2021-08-01	2021-09-01	2021-10-01	2021-11-01	2021-12-01
120	NaN	NaN	80	NaN
NaN	NaN	40	NaN	20
NaN	50	NaN	50	NaN
NaN	NaN	100	NaN	NaN
300	NaN	NaN	NaN	NaN

2021-08-01	2021-09-01	2021-10-01	2021-11-01	2021-12-01
120	NaN	NaN	80	NaN
NaN	NaN	40	NaN	20
NaN	50	NaN	50	NaN
NaN	NaN	100	NaN	NaN
300	NaN	NaN	NaN	NaN

2021-08-01	2021-09-01	2021-10-01	2021-11-01	2021-12-01
120	NaN	NaN	80	NaN
NaN	NaN	40	NaN	20
NaN	50	NaN	50	NaN
NaN	NaN	100	NaN	NaN
300	NaN	NaN	NaN	NaN