Home > Mobile >  Pivot and create running total from dataset using Python
Pivot and create running total from dataset using Python

Time:10-08

I have a dataset, df, where I would like to pivot its rows to column titles and create a running total:

Data

cons_power  cons_count  id  date    r_power r_count d_power d_count
500         20          aa  q122    50      2       20      1
0           0           aa  q222    20      1       0       0
1000        80          bb  q122    0       0       50      3
0           0           bb  q222    100     5       0       0

Desired

id  q122_power  q122_count  q222_power  q222_count
aa  530         21          550         22
bb  950         77          1050        82

Logic

We start w cons_power value of 500 for id 'aa' for q122 and we add r_power value of 50, and subtract d_power value of 20 which gives us: 530 value

We then carry on the value of 530 as our next baseline value for q222 process of adding r_power and subtracting d_power = 550

Same concept is used for the count. (start w a count of 20 for 'aa' q122 and add 2 and subtract 1) giving us: 21 value

We then carry on the value of 21 as our next baseline value for q222 process of adding r_count and subtracting d_count = 22

Doing

df['q122_power'] = df['r_power'].add(df.groupby('id', 'date') 
['r_power'].cumsum()).sub(df.groupby('id', 'date')['d_power'].cumsum())

df['q222_power'] = df['r_power'].add(df.groupby('id', 'date') 
['r_power'].cumsum()).sub(df.groupby('id', 'date')['d_power'].cumsum())

df['q122_count'] = df['r_count'].add(df.groupby('id', 'date') 
['r_count'].cumsum()).sub(df.groupby('id', 'date')['d_count'].cumsum())

df['q222_count'] = df['r_count'].add(df.groupby('id', 'date') 
['r_count'].cumsum()).sub(df.groupby('id', 'date')['d_count'].cumsum())

Trying to find a more elegant way of producing the desired output. I have been able to solve the grouping and calculations but unsure on how to incorporate the pivot steps within the script. Any suggestion is appreciated.

CodePudding user response:

Compute power and count before groupby then pivot your dataframe. Finally flat your column multi index and reset row index.

out = df[['id', 'date']].join(
        df.assign(power=df['cons_power']   df['r_power'] - df['d_power'],
                  count=df['cons_count']   df['r_count'] - df['d_count'])
          .groupby('id')[['power', 'count']].cumsum()
    ).pivot('id', 'date', ['power', 'count'])

out.columns = out.columns.to_flat_index().map('_'.join)

Output

>>> out.reset_index()
   id  power_q122  power_q222  count_q122  count_q222
0  aa         530         550          21          22
1  bb         950        1050          77          82

CodePudding user response:

We can also create a MultiIndex to allow for computations to be done using the top level index groups cons, r and d:

# Columns Not To Split into MultiIndex
df = df.set_index(['id', 'date'])
# Split into MultiIndex
df.columns = df.columns.str.split('_', expand=True)
df = (
    # Do Computations on the 2D top-level indexes
    (df['cons']   df['r'] - df['d'])
        .groupby(level='id').cumsum()  # Rolling Total within IDs
        .unstack(level='date')  # Convert date index level into Column Level
)
# Collapse MultiIndex
df.columns = df.columns.map('_'.join)
# Restore ID column
df = df.reset_index()

df:

   id  power_q122  power_q222  count_q122  count_q222
0  aa         530         550          21          22
1  bb         950        1050          77          82

How it works:

set_index then create MultiIndex with str.split:

# Columns Not To Split into MultiIndex
df = df.set_index(['id', 'date'])
# Split into MultiIndex
df.columns = df.columns.str.split('_', expand=True)


         cons           r           d      
        power count power count power count
id date                                    
aa q122   500    20    50     2    20     1
   q222     0     0    20     1     0     0
bb q122  1000    80     0     0    50     3
   q222     0     0   100     5     0     0

We can now access the top level indexes and do the computation (index alignment will ensure that the computations occur within the lower level indexes power and count):

df['cons']   df['r'] - df['d']

         power  count
id date              
aa q122    530     21
   q222     20      1
bb q122    950     77
   q222    100      5

Then take the groupby cumsum to get a rolling total within each id:

(df['cons']   df['r'] - df['d']).groupby(level='id').cumsum()

         power  count
id date              
aa q122    530     21
   q222    550     22
bb q122    950     77
   q222   1050     82

unstack the Date column to convert the index level to a column level:

(df['cons']   df['r'] - df['d']).groupby(level='id').cumsum().unstack(level='date') 

     power       count     
date  q122  q222  q122 q222
id                         
aa     530   550    21   22
bb     950  1050    77   82

Now the rest is just cleanup collapse the MultiIndex with Index.map:

df.columns = df.columns.map('_'.join)

    power_q122  power_q222  count_q122  count_q222
id                                                
aa         530         550          21          22
bb         950        1050          77          82

And reset_index to restore a default range index:

df = df.reset_index()

   id  power_q122  power_q222  count_q122  count_q222
0  aa         530         550          21          22
1  bb         950        1050          77          82

Setup and imports:

import pandas as pd

df = pd.DataFrame({
    'cons_power': [500, 0, 1000, 0], 'cons_count': [20, 0, 80, 0],
    'id': ['aa', 'aa', 'bb', 'bb'], 'date': ['q122', 'q222', 'q122', 'q222'],
    'r_power': [50, 20, 0, 100], 'r_count': [2, 1, 0, 5],
    'd_power': [20, 0, 50, 0], 'd_count': [1, 0, 3, 0]
})

CodePudding user response:

Modify df so that any column starting with d_ has a negative sign:

df_c = pd.DataFrame({key : value * -1 
                    if key.startswith('d_') 
                    else value  
                    for key, value 
                    in df.items()})


df_c = df_c.set_index(['id', 'date'])

Filter for power and count and aggregate:

power = df_c.filter(like='power').sum(1).rename('power')

count = df_c.filter(like='count').sum(1).rename('count')

df_c = pd.concat([power, count], axis = 1)

Compute the cumulative sum on the id and reshape:

df_c = (df_c.groupby('id')
            .cumsum()
            .unstack('date')
            .swaplevel(axis=1)
            .rename_axis(columns=[None, None])
         )

df_c.columns = df_c.columns.map('_'.join)

df_c.reset_index()
 
   id  q122_power  q222_power  q122_count  q222_count
0  aa         530         550          21          22
1  bb         950        1050          77          82

  • Related