Home > Enterprise >  DataFrame cleaning
DataFrame cleaning

Time:09-30

I have an excel spreadsheet which, when imported looks similar to:

df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
2021-08-01 2021-09-01 2021-10-01 2021-11-01 2021-12-01
120 NaN NaN 80 NaN
NaN NaN 40 NaN 20
NaN 50 NaN 50 NaN
NaN NaN 100 NaN NaN
300 NaN NaN NaN NaN

I'm looking (via python) to transform it into something like this:

shouldbe = pd.DataFrame({
"PayDate1": 
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))
PayDate1 Amount1 PayDate2 Amount2
2021-08-01 120 2021-11-01 80
2021-10-01 40 2021-12-01 20
2021-09-01 50 2021-11-01 50
2021-10-01 100 NaT NaN
2021-08-01 300 NaT NaN

I'm looking for some examples of how to achieve this kind of transformation, thanks in advance for any help.

CodePudding user response:

You can use melt, groupby and pivot to get the expected dataframe:

  1. Reshape your dataframe with melt:
out = df.reset_index() \
        .melt(id_vars='index', var_name='PayDate', value_name='Amount') \
        .dropna()
print(out)

# Output
    index    PayDate  Amount
0       0 2021-08-01   120.0  # <- index 0, 1st occurrence
4       4 2021-08-01   300.0  # <- index 4, 1st occurrence
7       2 2021-09-01    50.0  # <- index 2, 1st occurrence
11      1 2021-10-01    40.0  # <- index 1, 1st occurrence
13      3 2021-10-01   100.0  # <- index 3, 1st occurrence
15      0 2021-11-01    80.0  # <- index 0, 2nd occurrence
17      2 2021-11-01    50.0  # <- index 2, 2nd occurrence
21      1 2021-12-01    20.0  # <- index 1, 2nd occurrence
  1. Group by index and apply cumcount to create index of new columns ('1' & '2' as string for future concatenation):
out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)

# Output:
    index    PayDate  Amount  col
0       0 2021-08-01   120.0    1
4       4 2021-08-01   300.0    1
7       2 2021-09-01    50.0    1
11      1 2021-10-01    40.0    1
13      3 2021-10-01   100.0    1
15      0 2021-11-01    80.0    2
17      2 2021-11-01    50.0    2
21      1 2021-12-01    20.0    2
  1. Pivot the dataframe
out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)

# Output
         PayDate            Amount      
col            1          2      1     2
index                                   
0     2021-08-01 2021-11-01  120.0  80.0
1     2021-10-01 2021-12-01   40.0  20.0
2     2021-09-01 2021-11-01   50.0  50.0
3     2021-10-01        NaT  100.0   NaN
4     2021-08-01        NaT  300.0   NaN
  1. Get the final dataframe
cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None

out = out[out.columns[cols]]
print(out)
    PayDate1 Amount1   PayDate2 Amount2
0 2021-08-01   120.0 2021-11-01    80.0
1 2021-10-01    40.0 2021-12-01    20.0
2 2021-09-01    50.0 2021-11-01    50.0
3 2021-10-01   100.0        NaT     NaN
4 2021-08-01   300.0        NaT     NaN

CodePudding user response:

Purely for completeness, here's the non-pandas way of doing it:

import math
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})

dates = df.columns
out = {k: [] for k in dates}

for row in df.iterrows():
    for i, val in enumerate(row[1]):
        d = dates[i]
        if not math.isnan(val):
            out[d].append(val)

print(out)

This is not pandasy (indeed the final output here is not even a pandas dataframe, although it's trivial to transform it back into one), but I claim it's a good deal easier to read and thus More Pythonic (TM). More to the point it might suit some use-cases better.

  • Related