Create groupings based off of multiple categorical variables in Python Pandas?-CodePudding

Let's say I have a dataframe of people with columns age ("old" or "young"), gender ("male" or "female"), and education ("no high school", "high school", or "college"). I want to group people into 12 groupings based off of these columns. The following code gets me what I want, but I was wondering if there is a more idiomatic way in Pandas to do this?

i = 1
df['group'] = 0
for a in ['old', 'young']:
    for g in ['male', 'female']:
        for e in ["no high school", "high school", "college"]:
            df.loc[((df.age == a) &
                    (df.gender == g) &
                    (df.education == e)), 'group'] = i
            i = i   1

CodePudding user response：

Yes, if I understand correctly:

df['group'] = df.groupby(['age', 'gender', 'education']).ngroup()

CodePudding user response：

You can use pd.factorize:

cats = ['age', 'gender', 'education']
df['group'] = pd.factorize(df[cats].apply(frozenset, axis=1))[0]
print(df)

# Output:
     age  gender       education  group
0    old  female         college      0
1    old    male     high school      1
2    old    male  no high school      2
3    old    male         college      3
4    old    male  no high school      2
5  young    male     high school      4
6    old    male         college      3
7    old  female  no high school      5
8    old  female  no high school      5
9    old  female  no high school      5