I have the following dataframe:
ID Name Study City Grade Group_ID
1 Vir BE Ahm 1 1
2 Vish ME Rjk 2 2
3 Keval B.sc Beng 4 3
4 Vir M.sc Ahm 1 4
5 Priya B.com Una 2 5
6 Vish ME Rjk 2 2
7 Keval B.sc Pune 4 3
8 Vish ME Ahm 2 NaN
9 Keval B.sc Pune 4 NaN
10 Yesha M.sc Beng 2 NaN
11 Vasu M.Tech Hyd 0 6
12 Kria B.Tech Mub 6 7
I want to increase the number of group_ID columns by name, city and grade column. If the value of the name, city and grade column is present in the dataframe, then the value of Group_ID is to be taken. if not present then increase the Group_ID value by a specific number or max(Group_ID 1).
What is the best way to get the result of the dataframe below?
ID Name Study City Grade Group_ID
1 Vir BE Ahm 1 1
2 Vish ME Rjk 2 2
3 Keval B.sc Beng 4 3
4 Vir M.sc Ahm 1 4
5 Priya B.com Una 2 5
6 Vish ME Rjk 2 2
7 Keval B.sc Pune 4 3
8 Vish ME Ahm 2 2
9 Keval B.sc Pune 4 3
10 Yesha M.sc Beng 2 8
11 Vasu M.Tech Hyd 0 6
12 Kria B.Tech Mub 6 7
I have tried by df['Group_ID'] = df.groupby(['Name', 'City', 'Grade'], sort=False).ngroup() 8 but in won't worked for present values of column. I believe there will be an easier way to do this.
CodePudding user response:
You can use a loop:
max_group_id = df["Group_ID"].max()
groups = df.set_index(["Name", "City", "Grade"])["Group_ID"].dropna().to_dict()
group_ids = []
for _, (name, city, grade, id_) in df[["Name", "City", "Grade", "Group_ID"]].iterrows():
# If a row already has a Group ID, use that
# If not, find the group with the same (name, city, grade)
# If there is no such group, create a new group:
id_ = id_ if pd.notna(id_) else groups.get((name, city, grade), np.nan)
if np.isnan(id_):
id_ = max_group_id 1
max_group_id = 1
groups[(name, city, grade)] = id_
group_ids.append(id_)
df["Group_ID_New"] = np.array(group_ids, dtype="int")
CodePudding user response:
Use:
#replace missing values by existing groups
s = df.groupby(["Name", "City", "Grade"])["Group_ID"].transform('first')
df['Group_ID'] = df['Group_ID'].fillna(s)
#mask for missing values of not existing groups
mask = s.isna()
#only for these groups use ngroup with add maximal Group_ID
df.loc[mask, 'Group_ID'] = (df[mask].groupby(['Name', 'City', 'Grade'], sort=False).ngroup()
df["Group_ID"].max() 1)
#convert to integers
df['Group_ID'] = df['Group_ID'].astype(int)
print (df)
ID Name Study City Grade Group_ID
0 1 Vir BE Ahm 1 1
1 2 Vish ME Rjk 2 2
2 3 Keval B.sc Beng 4 3
3 4 Vir M.sc Ahm 1 4
4 5 Priya B.com Una 2 5
5 6 Vish ME Rjk 2 2
6 7 Keval B.sc Pune 4 3
7 8 Vish ME Ahm 2 8
8 9 Keval B.sc Pune 4 3
9 10 Yesha M.sc Beng 2 9
10 11 Vasu M.Tech Hyd 0 6
11 12 Kria B.Tech Mub 6 7
Output is different, because Vish,Rjk,2
not match Vish,Ahm,2
, so new Group_ID
was created.