Home > Software design >  Transpose dataframe with cells as sum over columns
Transpose dataframe with cells as sum over columns

Time:09-28

I have a dataframe in the following form:

          x_30d   x_60d   y_30d   y_60d 
127         1.0     1.0     0.0     1.0
223         1.0     0.0     1.0     NaN
1406        1.0     NaN     1.0     0.0
2144        1.0     0.0     1.0     1.0
2234        1.0     0.0     NaN     NaN

I need to transform it into the following form (where each cell is the sum over each column above):

        30d     60d
x         5       1 
y         3       2

I've tried using dictionaries and splitting columns. melting the dataframe, along with transposing it, etc. but I cannot seem to get the correct pattern.

To make things slightly more complicated, here are some actual column names that have a mix of forms for date ranges: PASC_new_aches_30d_60d, PASC_new_aches_60d_180d, ... PASC_new_aches_360d, ..., PASC_new_jt_pain_180d_360d, ...

CodePudding user response:

In [131]: new = df.sum()

In [132]: new.index = pd.MultiIndex.from_frame(
                            new.index.str.extract(r"^(.*?)_(\d d.*)$"))

In [133]: new
Out[133]:
0               1
PASC_new_aches  30d_60d     5.0
                60d_180d    1.0
x               30d         3.0
PASC_new_aches  360d        2.0
dtype: float64

In [134]: new.unstack()
Out[134]:
1               30d  30d_60d  360d  60d_180d
0
PASC_new_aches  NaN      5.0   2.0       1.0
x               3.0      NaN   NaN       NaN
  • sum as usual per column

  • original's columns are now at the index; need to split them

    • using a regex here: ^(.*?)_(\d d.*)$
      • ^: beginning
      • (.*?) anything, but greedily until...
      • _(\d d.*) ...underscore followed by d pattern; also anything after it
      • $ the end
  • while splitting we extracted before & after of an underscore with (...)s

  • make them the new index (a multiindex now)

  • unstack the inner level to become new columns, i.e., the parts after "_"

noting that those "1" and "0" at the top left are the "name"s of the axes of the frame; 0 is that of df.index, 1 is of df.columns. They are there due to pd.MultiIndex.from_frame. Can remove by .rename_axis(index=None, columns=None).

CodePudding user response:

one option is with pivot_longer from pyjanitor:

# pip install pyjanitor
import janitor
import pandas as pd

(df
.agg(['sum'])
.pivot_longer(
     index = None, 
     names_to = ('other', '.value'), 
     names_sep='_')
)
  other  30d  60d
0     x  5.0  1.0
1     y  3.0  2.0

The .value determines which parts of the columns remain as column headers.

If your dataframe looks complicated (based on the columns you shared):

      PASC_new_aches_30d_60d  PASC_new_aches_60d_180d  PASC_new_aches_360d  PASC_new_jt_pain_180d_360d
127                      1.0                      1.0                  0.0                         1.0
223                      1.0                      0.0                  1.0                         NaN
1406                     1.0                      NaN                  1.0                         0.0
2144                     1.0                      0.0                  1.0                         1.0
2234                     1.0                      0.0                  NaN                         NaN

then a regex, similar to @MustafaAydin works better:

(df
.agg(['sum'])
.pivot_longer(
     index=None, 
     names_to = ('other', '.value'), 
     names_pattern=r"(\D )_(. )")
)
              other  30d_60d  60d_180d  360d  180d_360d
0    PASC_new_aches      5.0       1.0   3.0        NaN
1  PASC_new_jt_pain      NaN       NaN   NaN        2.0
  • Related