I have an excel spreadsheet which, when imported looks similar to:
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
2021-08-01 | 2021-09-01 | 2021-10-01 | 2021-11-01 | 2021-12-01 |
---|---|---|---|---|
120 | NaN | NaN | 80 | NaN |
NaN | NaN | 40 | NaN | 20 |
NaN | 50 | NaN | 50 | NaN |
NaN | NaN | 100 | NaN | NaN |
300 | NaN | NaN | NaN | NaN |
I'm looking (via python) to transform it into something like this:
shouldbe = pd.DataFrame({
"PayDate1":
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))
PayDate1 | Amount1 | PayDate2 | Amount2 |
---|---|---|---|
2021-08-01 | 120 | 2021-11-01 | 80 |
2021-10-01 | 40 | 2021-12-01 | 20 |
2021-09-01 | 50 | 2021-11-01 | 50 |
2021-10-01 | 100 | NaT | NaN |
2021-08-01 | 300 | NaT | NaN |
I'm looking for some examples of how to achieve this kind of transformation, thanks in advance for any help.
CodePudding user response:
You can use melt
, groupby
and pivot
to get the expected dataframe:
- Reshape your dataframe with
melt
:
out = df.reset_index() \
.melt(id_vars='index', var_name='PayDate', value_name='Amount') \
.dropna()
print(out)
# Output
index PayDate Amount
0 0 2021-08-01 120.0 # <- index 0, 1st occurrence
4 4 2021-08-01 300.0 # <- index 4, 1st occurrence
7 2 2021-09-01 50.0 # <- index 2, 1st occurrence
11 1 2021-10-01 40.0 # <- index 1, 1st occurrence
13 3 2021-10-01 100.0 # <- index 3, 1st occurrence
15 0 2021-11-01 80.0 # <- index 0, 2nd occurrence
17 2 2021-11-01 50.0 # <- index 2, 2nd occurrence
21 1 2021-12-01 20.0 # <- index 1, 2nd occurrence
- Group by
index
and applycumcount
to create index of new columns ('1' & '2' as string for future concatenation):
out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)
# Output:
index PayDate Amount col
0 0 2021-08-01 120.0 1
4 4 2021-08-01 300.0 1
7 2 2021-09-01 50.0 1
11 1 2021-10-01 40.0 1
13 3 2021-10-01 100.0 1
15 0 2021-11-01 80.0 2
17 2 2021-11-01 50.0 2
21 1 2021-12-01 20.0 2
- Pivot the dataframe
out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)
# Output
PayDate Amount
col 1 2 1 2
index
0 2021-08-01 2021-11-01 120.0 80.0
1 2021-10-01 2021-12-01 40.0 20.0
2 2021-09-01 2021-11-01 50.0 50.0
3 2021-10-01 NaT 100.0 NaN
4 2021-08-01 NaT 300.0 NaN
- Get the final dataframe
cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None
out = out[out.columns[cols]]
print(out)
PayDate1 Amount1 PayDate2 Amount2
0 2021-08-01 120.0 2021-11-01 80.0
1 2021-10-01 40.0 2021-12-01 20.0
2 2021-09-01 50.0 2021-11-01 50.0
3 2021-10-01 100.0 NaT NaN
4 2021-08-01 300.0 NaT NaN
CodePudding user response:
Purely for completeness, here's the non-pandas way of doing it:
import math
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
dates = df.columns
out = {k: [] for k in dates}
for row in df.iterrows():
for i, val in enumerate(row[1]):
d = dates[i]
if not math.isnan(val):
out[d].append(val)
print(out)
This is not pandasy (indeed the final output here is not even a pandas dataframe, although it's trivial to transform it back into one), but I claim it's a good deal easier to read and thus More Pythonic (TM). More to the point it might suit some use-cases better.