Home > Enterprise >  Complex column and row reverse using Pandas
Complex column and row reverse using Pandas

Time:12-08

I have a sample dataframe, df, where I would like to maintain all columns in my original dataset and create a new pivoted column based on existing dataset. (grouped by id) Original dataframe has about 80 columns

Data

 s1    s2    id q122c q122a q122c2 q122a2 q222c q222a q222c2 q222a2 q322c q322a q322c2 q322a2
 50    1000  aa 40    10    900   100      50    0     1000   0     60    -10   1100  -100
 100   2000  bb 50    50    1500  500      75    25    1900   100    100    0    2000  0
                

                        
                                        

Desired

s1      s2      id  date    consumed1   avail1      consumed2   avail2  
50      1000    aa  q122    40          10          900         100             
50      1000    aa  q222    50          0           1000        0   
50      1000    aa  q322    60         -10          1100        -100
100     2000    bb  q122    50          50          1500        500             
100     2000    bb  q222    75          25          1900        100 
100     2000    bb  q322    100         0           2000        0


    

Doing

pd.wide_to_long(df.set_index(['stat1','stat2','id']).stack().groupby(level=[0,1,2,3]).agg(list).apply(pd.Series).unstack().stack(level=0).reset_index(),
                stubnames = ['q122','q222'], i = ['stat1','stat2','id','level_3'],j = 'date',suffix='\\w ').stack().unstack(level=[-3,-2])

A SO member has provided a script that will take the data from wide to long, however, the output I am getting is not capturing all the columns needed.

Any suggestion is appreciated.

CodePudding user response:

There are a couple of ways to go about this:

One option is to convert the columns to a MultiIndex, then reshape with stack:

index = ['s1', 's2', 'id']
temp = df.set_index(index)
temp.columns = temp.columns.str.split(r"([a|c]\d?)", expand = True).droplevel(-1)
temp.columns.names = ['date', None]
(temp.stack(level='date')
     .rename(columns={'c':'consumed1', 
                      'a':'avail1', 
                      'c2':'consumed2', 
                      'a2':'avail2'})
     .reset_index()
)

    s1    s2  id  date  avail1  avail2  consumed1  consumed2
0   50  1000  aa  q122      10     100         40        900
1   50  1000  aa  q222       0       0         50       1000
2   50  1000  aa  q322     -10    -100         60       1100
3  100  2000  bb  q122      50     500         50       1500
4  100  2000  bb  q222      25     100         75       1900
5  100  2000  bb  q322       0       0        100       2000

with pd.wide_to_long:

index = ['s1', 's2', 'id']
temp = df.set_index(index)
temp.columns = temp.columns.str.split(r"([a|c]\d?)").str[::-1].str[1:].str.join('_')

(pd.wide_to_long(temp.reset_index(), 
                 stubnames = ['c', 'a', 'c2', 'a2'], 
                 i = index, 
                 j = 'date', 
                 sep='_', 
                 suffix='. ')
  .rename(columns={'c':'consumed1', 
                   'a':'avail1', 
                   'c2':'consumed2', 
                   'a2':'avail2'})
  .reset_index()
)

    s1    s2  id  date  consumed1  avail1  consumed2  avail2
0   50  1000  aa  q122         40      10        900     100
1   50  1000  aa  q222         50       0       1000       0
2   50  1000  aa  q322         60     -10       1100    -100
3  100  2000  bb  q122         50      50       1500     500
4  100  2000  bb  q222         75      25       1900     100
5  100  2000  bb  q322        100       0       2000       0

Another option is with pivot_longer from pyjanitor, that offers a simpler way for such complex reshaping :

#pip install pyjanitor
import pandas as pd
import janitor
(df.pivot_longer(slice('s1', 'id'), 
                 names_to = ('date', '.value'), 
                 names_pattern = r"([a-z]\d )(. )", 
                 sort_by_appearance=True)
  .rename(columns={'c':'consumed1', 
                   'a':'avail1', 
                   'c2':'consumed2', 
                   'a2':'avail2'})
)
    s1    s2  id  date  consumed1  avail1  consumed2  avail2
0   50  1000  aa  q122         40      10        900     100
1   50  1000  aa  q222         50       0       1000       0
2   50  1000  aa  q322         60     -10       1100    -100
3  100  2000  bb  q122         50      50       1500     500
4  100  2000  bb  q222         75      25       1900     100
5  100  2000  bb  q322        100       0       2000       0

The names_pattern indicates how the columns will be reshaped. there are two values in names_to: ('date', '.value'). There are also two regex groups in names_pattern. The first regex group will be paired with date to form a new column, while the second regex group will be paired with .value to form new headers.

  • Related