What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2
0 NaN NaN NaN
1 NaN NaN NaN
2 X 1 1
3 X 1 2
4 X 1 3
5 X 1 4
6 NaN NaN NaN
7 X 2 1
8 X 2 2
9 X 2 3
10 X 2 4
11 X 2 5
12 X 2 6
13 NaN NaN NaN
14 NaN NaN NaN
15 X 3 1
16 X 3 2
The test column can only contain X's or NaNs. The number of consecutive X's is random.
In the 'desired_output_level_1' column, trying to count up the number of series of X's.
In the 'desired_output_level_2' column, trying to find the duration of each series.
Can anyone help? Thanks in advance.
CodePudding user response:
Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
- we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
- For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
- For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each
level_1
group and do.transform
to preserve the indexing:
df.loc[mask, "level_2"] = (
df.loc[mask, ["level_1"]]
.assign(level_2=1)
.groupby("level_1")["level_2"]
.transform("cumsum")
)
Last step (if needed) is to transform columns to strings
:
df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')