what is the best way to create running total columns in pandas-CodePudding

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?

input:

import pandas as pd
import numpy as np

df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'

output:

   test desired_output_level_1 desired_output_level_2
0   NaN                    NaN                    NaN
1   NaN                    NaN                    NaN
2     X                      1                      1
3     X                      1                      2
4     X                      1                      3
5     X                      1                      4
6   NaN                    NaN                    NaN
7     X                      2                      1
8     X                      2                      2
9     X                      2                      3
10    X                      2                      4
11    X                      2                      5
12    X                      2                      6
13  NaN                    NaN                    NaN
14  NaN                    NaN                    NaN
15    X                      3                      1
16    X                      3                      2

The test column can only contain X's or NaNs. The number of consecutive X's is random.

In the 'desired_output_level_1' column, trying to count up the number of series of X's.

In the 'desired_output_level_2' column, trying to find the duration of each series.

Can anyone help? Thanks in advance.

CodePudding user response：

Perhaps not the most pandastic way, but seems to yield what you are after.

Three key points:

we are operating on only rows that are not NaN, so let's create a mask:

mask = df['test'].notna()

For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:

df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()

For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:

df.loc[mask, "level_2"] = (
    df.loc[mask, ["level_1"]]
    .assign(level_2=1)
    .groupby("level_1")["level_2"]
    .transform("cumsum")
)

Last step (if needed) is to transform columns to strings:

df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')