How to reset the incrementing values when assigning values to groups in a pandas dataframe?-CodePudding

I have a pandas dataframe which looks like this after the following code:

df['row_l0'] = df.groupby('Category',sort=False).ngroup() 1
df['row_l1'] = df.groupby(['Category','Process'],sort=False).ngroup() 1
df['row_l2'] = df.groupby(['Process','Parent'],sort=False).ngroup() 1

    Category    Process Parent      row_l0  row_l1  row_l2
127         A       a.5     a.5.4       1       5       24
128         A       a.6     a.6.1       1       6       25
129         A       a.6     a.6.2       1       6       26
130         A       a.6     a.6.3       1       6       26
131         A       a.6     a.6.4       1       6       27
132         A       a.6     a.6.5       1       6       27
133         B       b.1     b.1.1       2       7       28
134         B       b.2     b.2.1       2       8       29
135         B       b.2     b.2.2       2       8       29

For clarity, row_l0 relates to Category, row_l1 relates to Process and row_l2 to Parent.

The row_l0 is correct, but I can't seem to be able to reset the count/grouping for the subsequent groups (row_l1 and row_l2) when I get to category B (and beyond).

E.g. at index 133, row_l0 is correctly assigned 2, but for row_l1 instead of continuing from 6 to 7, it should reset due to being a new Category and assigned 1

Ideally, I'd end up with something like:

    Category    Process Parent      row_l0  row_l1  row_l2
127         A       a.5     a.5.4       1       5       24
128         A       a.6     a.6.1       1       6       25
129         A       a.6     a.6.2       1       6       26
130         A       a.6     a.6.3       1       6       26
131         A       a.6     a.6.4       1       6       27
132         A       a.6     a.6.5       1       6       27
133         B       b.1     b.1.1       2       1       1
134         B       b.2     b.2.1       2       2       2
135         B       b.2     b.2.2       2       2       3

I've been trying a mixture of combinations using groupby, rank and ngroup, and adapted various other questions (or this one) but can't seem to get it working.

CodePudding user response：

With the following dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "Category": ["A", "A", "A", "A", "A", "A", "B", "B", "B"],
        "Process": ["a.5", "a.6", "a.6", "a.6", "a.6", "a.6", "b.1", "b.2", "b.2"],
        "Parent": [
            "a.5.4",
            "a.6.1",
            "a.6.2",
            "a.6.3",
            "a.6.4",
            "a.6.5",
            "b.1.1",
            "b.2.1",
            "b.2.2",
        ],
    },
)

Here is one way to do it:

df["row_l0"] = df["Category"].apply(
    lambda x: {col: i   1 for i, col in enumerate(df["Category"].unique())}[x]
)

df["row_l1"] = df["Process"].apply(lambda x: x[-1])

df["row_l2"] = [
    j   1
    for count in df["Parent"].str[0].value_counts().to_dict().values()
    for j in range(count)
]

print(df)
# Output
  Category Process Parent  row_l0 row_l1  row_l2
0        A     a.5  a.5.4       1      5       1
1        A     a.6  a.6.1       1      6       2
2        A     a.6  a.6.2       1      6       3
3        A     a.6  a.6.3       1      6       4
4        A     a.6  a.6.4       1      6       5
5        A     a.6  a.6.5       1      6       6
6        B     b.1  b.1.1       2      1       1
7        B     b.2  b.2.1       2      2       2
8        B     b.2  b.2.2       2      2       3

CodePudding user response：

To achieve the counting restarts you need to nest the groupbys.

One way to do that would be:

df["row_l0"] = df.groupby("Category").ngroup()   1
for i, col in enumerate(("Process", "Parent"), start=1):
    df[f"row_l{i}"] = (
        df.groupby("Category")[col]
        .transform(lambda ser: ser.groupby(ser).ngroup()   1)
    )

Or if you only want to group by Category once:

df["row_l0"] = df["row_l1"] = df["row_l2"] = 0
for i, (_, sdf) in enumerate(df.groupby("Category"), start=1):
    idx = sdf.index
    df.loc[idx, "row_l0"] = i
    df.loc[idx, "row_l1"] = sdf.groupby("Process").ngroup()   1
    df.loc[idx, "row_l2"] = sdf.groupby("Parent").ngroup()   1