Let's say I have a dataframe of people with columns age ("old" or "young"), gender ("male" or "female"), and education ("no high school", "high school", or "college"). I want to group people into 12 groupings based off of these columns. The following code gets me what I want, but I was wondering if there is a more idiomatic way in Pandas to do this?
i = 1
df['group'] = 0
for a in ['old', 'young']:
for g in ['male', 'female']:
for e in ["no high school", "high school", "college"]:
df.loc[((df.age == a) &
(df.gender == g) &
(df.education == e)), 'group'] = i
i = i 1
CodePudding user response:
Yes, if I understand correctly:
df['group'] = df.groupby(['age', 'gender', 'education']).ngroup()
CodePudding user response:
You can use pd.factorize
:
cats = ['age', 'gender', 'education']
df['group'] = pd.factorize(df[cats].apply(frozenset, axis=1))[0]
print(df)
# Output:
age gender education group
0 old female college 0
1 old male high school 1
2 old male no high school 2
3 old male college 3
4 old male no high school 2
5 young male high school 4
6 old male college 3
7 old female no high school 5
8 old female no high school 5
9 old female no high school 5