I have a dataframe in the following form:
x_30d x_60d y_30d y_60d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
I need to transform it into the following form (where each cell is the sum over each column above):
30d 60d
x 5 1
y 3 2
I've tried using dictionaries and splitting columns. melting the dataframe, along with transposing it, etc. but I cannot seem to get the correct pattern.
To make things slightly more complicated, here are some actual column names that have a mix of forms for date ranges: PASC_new_aches_30d_60d
, PASC_new_aches_60d_180d
, ... PASC_new_aches_360d
, ..., PASC_new_jt_pain_180d_360d
, ...
CodePudding user response:
In [131]: new = df.sum()
In [132]: new.index = pd.MultiIndex.from_frame(
new.index.str.extract(r"^(.*?)_(\d d.*)$"))
In [133]: new
Out[133]:
0 1
PASC_new_aches 30d_60d 5.0
60d_180d 1.0
x 30d 3.0
PASC_new_aches 360d 2.0
dtype: float64
In [134]: new.unstack()
Out[134]:
1 30d 30d_60d 360d 60d_180d
0
PASC_new_aches NaN 5.0 2.0 1.0
x 3.0 NaN NaN NaN
sum as usual per column
original's columns are now at the index; need to split them
- using a regex here:
^(.*?)_(\d d.*)$
^
: beginning(.*?)
anything, but greedily until..._(\d d.*)
...underscore followed by d pattern; also anything after it$
the end
- using a regex here:
while splitting we extracted before & after of an underscore with
(...)
smake them the new index (a multiindex now)
unstack the inner level to become new columns, i.e., the parts after "_"
noting that those "1" and "0" at the top left are the "name"s of the axes of the frame; 0 is that of df.index
, 1 is of df.columns
. They are there due to pd.MultiIndex.from_frame. Can remove by .rename_axis(index=None, columns=None)
.
CodePudding user response:
one option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.agg(['sum'])
.pivot_longer(
index = None,
names_to = ('other', '.value'),
names_sep='_')
)
other 30d 60d
0 x 5.0 1.0
1 y 3.0 2.0
The .value
determines which parts of the columns remain as column headers.
If your dataframe looks complicated (based on the columns you shared):
PASC_new_aches_30d_60d PASC_new_aches_60d_180d PASC_new_aches_360d PASC_new_jt_pain_180d_360d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
then a regex, similar to @MustafaAydin works better:
(df
.agg(['sum'])
.pivot_longer(
index=None,
names_to = ('other', '.value'),
names_pattern=r"(\D )_(. )")
)
other 30d_60d 60d_180d 360d 180d_360d
0 PASC_new_aches 5.0 1.0 3.0 NaN
1 PASC_new_jt_pain NaN NaN NaN 2.0