I have a pandas dataframe which looks like this after the following code:
df['row_l0'] = df.groupby('Category',sort=False).ngroup() 1
df['row_l1'] = df.groupby(['Category','Process'],sort=False).ngroup() 1
df['row_l2'] = df.groupby(['Process','Parent'],sort=False).ngroup() 1
Category Process Parent row_l0 row_l1 row_l2
127 A a.5 a.5.4 1 5 24
128 A a.6 a.6.1 1 6 25
129 A a.6 a.6.2 1 6 26
130 A a.6 a.6.3 1 6 26
131 A a.6 a.6.4 1 6 27
132 A a.6 a.6.5 1 6 27
133 B b.1 b.1.1 2 7 28
134 B b.2 b.2.1 2 8 29
135 B b.2 b.2.2 2 8 29
For clarity, row_l0
relates to Category
, row_l1
relates to Process
and row_l2
to Parent
.
The row_l0
is correct, but I can't seem to be able to reset the count/grouping for the subsequent groups (row_l1
and row_l2
) when I get to category B (and beyond).
E.g. at index 133
, row_l0
is correctly assigned 2
, but for row_l1
instead of continuing from 6
to 7
, it should reset due to being a new Category
and assigned 1
Ideally, I'd end up with something like:
Category Process Parent row_l0 row_l1 row_l2
127 A a.5 a.5.4 1 5 24
128 A a.6 a.6.1 1 6 25
129 A a.6 a.6.2 1 6 26
130 A a.6 a.6.3 1 6 26
131 A a.6 a.6.4 1 6 27
132 A a.6 a.6.5 1 6 27
133 B b.1 b.1.1 2 1 1
134 B b.2 b.2.1 2 2 2
135 B b.2 b.2.2 2 2 3
I've been trying a mixture of combinations using groupby, rank and ngroup, and adapted various other questions (or this one) but can't seem to get it working.
CodePudding user response:
With the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Category": ["A", "A", "A", "A", "A", "A", "B", "B", "B"],
"Process": ["a.5", "a.6", "a.6", "a.6", "a.6", "a.6", "b.1", "b.2", "b.2"],
"Parent": [
"a.5.4",
"a.6.1",
"a.6.2",
"a.6.3",
"a.6.4",
"a.6.5",
"b.1.1",
"b.2.1",
"b.2.2",
],
},
)
Here is one way to do it:
df["row_l0"] = df["Category"].apply(
lambda x: {col: i 1 for i, col in enumerate(df["Category"].unique())}[x]
)
df["row_l1"] = df["Process"].apply(lambda x: x[-1])
df["row_l2"] = [
j 1
for count in df["Parent"].str[0].value_counts().to_dict().values()
for j in range(count)
]
print(df)
# Output
Category Process Parent row_l0 row_l1 row_l2
0 A a.5 a.5.4 1 5 1
1 A a.6 a.6.1 1 6 2
2 A a.6 a.6.2 1 6 3
3 A a.6 a.6.3 1 6 4
4 A a.6 a.6.4 1 6 5
5 A a.6 a.6.5 1 6 6
6 B b.1 b.1.1 2 1 1
7 B b.2 b.2.1 2 2 2
8 B b.2 b.2.2 2 2 3
CodePudding user response:
To achieve the counting restarts you need to nest the groupby
s.
One way to do that would be:
df["row_l0"] = df.groupby("Category").ngroup() 1
for i, col in enumerate(("Process", "Parent"), start=1):
df[f"row_l{i}"] = (
df.groupby("Category")[col]
.transform(lambda ser: ser.groupby(ser).ngroup() 1)
)
Or if you only want to group by Category
once:
df["row_l0"] = df["row_l1"] = df["row_l2"] = 0
for i, (_, sdf) in enumerate(df.groupby("Category"), start=1):
idx = sdf.index
df.loc[idx, "row_l0"] = i
df.loc[idx, "row_l1"] = sdf.groupby("Process").ngroup() 1
df.loc[idx, "row_l2"] = sdf.groupby("Parent").ngroup() 1