Home > Blockchain >  what is the best way to create running total columns in pandas
what is the best way to create running total columns in pandas

Time:06-12

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?

input:

import pandas as pd
import numpy as np

df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'

output:

   test desired_output_level_1 desired_output_level_2
0   NaN                    NaN                    NaN
1   NaN                    NaN                    NaN
2     X                      1                      1
3     X                      1                      2
4     X                      1                      3
5     X                      1                      4
6   NaN                    NaN                    NaN
7     X                      2                      1
8     X                      2                      2
9     X                      2                      3
10    X                      2                      4
11    X                      2                      5
12    X                      2                      6
13  NaN                    NaN                    NaN
14  NaN                    NaN                    NaN
15    X                      3                      1
16    X                      3                      2

The test column can only contain X's or NaNs. The number of consecutive X's is random.

In the 'desired_output_level_1' column, trying to count up the number of series of X's.

In the 'desired_output_level_2' column, trying to find the duration of each series.

Can anyone help? Thanks in advance.

CodePudding user response:

Perhaps not the most pandastic way, but seems to yield what you are after.

Three key points:

  1. we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
  1. For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
  1. For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:
df.loc[mask, "level_2"] = (
    df.loc[mask, ["level_1"]]
    .assign(level_2=1)
    .groupby("level_1")["level_2"]
    .transform("cumsum")
)

Last step (if needed) is to transform columns to strings:

df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')
  • Related