Pivot and create running total from dataset using Python-CodePudding

I have a dataset, df, where I would like to pivot its rows to column titles and create a running total:

Data

cons_power  cons_count  id  date    r_power r_count d_power d_count
500         20          aa  q122    50      2       20      1
0           0           aa  q222    20      1       0       0
1000        80          bb  q122    0       0       50      3
0           0           bb  q222    100     5       0       0

Desired

id  q122_power  q122_count  q222_power  q222_count
aa  530         21          550         22
bb  950         77          1050        82

Logic

We start w cons_power value of 500 for id 'aa' for q122 and we add r_power value of 50, and subtract d_power value of 20 which gives us: 530 value

We then carry on the value of 530 as our next baseline value for q222 process of adding r_power and subtracting d_power = 550

Same concept is used for the count. (start w a count of 20 for 'aa' q122 and add 2 and subtract 1) giving us: 21 value

We then carry on the value of 21 as our next baseline value for q222 process of adding r_count and subtracting d_count = 22

Doing

df['q122_power'] = df['r_power'].add(df.groupby('id', 'date') 
['r_power'].cumsum()).sub(df.groupby('id', 'date')['d_power'].cumsum())

df['q222_power'] = df['r_power'].add(df.groupby('id', 'date') 
['r_power'].cumsum()).sub(df.groupby('id', 'date')['d_power'].cumsum())

df['q122_count'] = df['r_count'].add(df.groupby('id', 'date') 
['r_count'].cumsum()).sub(df.groupby('id', 'date')['d_count'].cumsum())

df['q222_count'] = df['r_count'].add(df.groupby('id', 'date') 
['r_count'].cumsum()).sub(df.groupby('id', 'date')['d_count'].cumsum())

Trying to find a more elegant way of producing the desired output. I have been able to solve the grouping and calculations but unsure on how to incorporate the pivot steps within the script. Any suggestion is appreciated.

CodePudding user response：

Compute power and count before groupby then pivot your dataframe. Finally flat your column multi index and reset row index.

out = df[['id', 'date']].join(
        df.assign(power=df['cons_power']   df['r_power'] - df['d_power'],
                  count=df['cons_count']   df['r_count'] - df['d_count'])
          .groupby('id')[['power', 'count']].cumsum()
    ).pivot('id', 'date', ['power', 'count'])

out.columns = out.columns.to_flat_index().map('_'.join)

Output

>>> out.reset_index()
   id  power_q122  power_q222  count_q122  count_q222
0  aa         530         550          21          22
1  bb         950        1050          77          82

CodePudding user response：

We can also create a MultiIndex to allow for computations to be done using the top level index groups cons, r and d:

# Columns Not To Split into MultiIndex
df = df.set_index(['id', 'date'])
# Split into MultiIndex
df.columns = df.columns.str.split('_', expand=True)
df = (
    # Do Computations on the 2D top-level indexes
    (df['cons']   df['r'] - df['d'])
        .groupby(level='id').cumsum()  # Rolling Total within IDs
        .unstack(level='date')  # Convert date index level into Column Level
)
# Collapse MultiIndex
df.columns = df.columns.map('_'.join)
# Restore ID column
df = df.reset_index()

df:

   id  power_q122  power_q222  count_q122  count_q222
0  aa         530         550          21          22
1  bb         950        1050          77          82

How it works:

set_index then create MultiIndex with str.split:

# Columns Not To Split into MultiIndex
df = df.set_index(['id', 'date'])
# Split into MultiIndex
df.columns = df.columns.str.split('_', expand=True)


         cons           r           d      
        power count power count power count
id date                                    
aa q122   500    20    50     2    20     1
   q222     0     0    20     1     0     0
bb q122  1000    80     0     0    50     3
   q222     0     0   100     5     0     0

We can now access the top level indexes and do the computation (index alignment will ensure that the computations occur within the lower level indexes power and count):

df['cons']   df['r'] - df['d']

         power  count
id date              
aa q122    530     21
   q222     20      1
bb q122    950     77
   q222    100      5

Then take the groupby cumsum to get a rolling total within each id:

(df['cons']   df['r'] - df['d']).groupby(level='id').cumsum()

         power  count
id date              
aa q122    530     21
   q222    550     22
bb q122    950     77
   q222   1050     82

unstack the Date column to convert the index level to a column level:

(df['cons']   df['r'] - df['d']).groupby(level='id').cumsum().unstack(level='date') 

     power       count     
date  q122  q222  q122 q222
id                         
aa     530   550    21   22
bb     950  1050    77   82

Now the rest is just cleanup collapse the MultiIndex with Index.map:

df.columns = df.columns.map('_'.join)

    power_q122  power_q222  count_q122  count_q222
id                                                
aa         530         550          21          22
bb         950        1050          77          82

And reset_index to restore a default range index:

df = df.reset_index()

   id  power_q122  power_q222  count_q122  count_q222
0  aa         530         550          21          22
1  bb         950        1050          77          82

Setup and imports:

import pandas as pd

df = pd.DataFrame({
    'cons_power': [500, 0, 1000, 0], 'cons_count': [20, 0, 80, 0],
    'id': ['aa', 'aa', 'bb', 'bb'], 'date': ['q122', 'q222', 'q122', 'q222'],
    'r_power': [50, 20, 0, 100], 'r_count': [2, 1, 0, 5],
    'd_power': [20, 0, 50, 0], 'd_count': [1, 0, 3, 0]
})

CodePudding user response：

Modify df so that any column starting with d_ has a negative sign:

df_c = pd.DataFrame({key : value * -1 
                    if key.startswith('d_') 
                    else value  
                    for key, value 
                    in df.items()})


df_c = df_c.set_index(['id', 'date'])

Filter for power and count and aggregate:

power = df_c.filter(like='power').sum(1).rename('power')

count = df_c.filter(like='count').sum(1).rename('count')

df_c = pd.concat([power, count], axis = 1)

Compute the cumulative sum on the id and reshape:

df_c = (df_c.groupby('id')
            .cumsum()
            .unstack('date')
            .swaplevel(axis=1)
            .rename_axis(columns=[None, None])
         )

df_c.columns = df_c.columns.map('_'.join)

df_c.reset_index()
 
   id  q122_power  q222_power  q122_count  q222_count
0  aa         530         550          21          22
1  bb         950        1050          77          82