Home > database >  Pandas - Groupby by three columns with cumsum or cumcount or nggroup
Pandas - Groupby by three columns with cumsum or cumcount or nggroup

Time:04-14

I have the following dataframe:

ID    Name    Study   City   Grade   Group_ID
1     Vir     BE      Ahm     1        1
2     Vish    ME      Rjk     2        2
3     Keval   B.sc    Beng    4        3
4     Vir     M.sc    Ahm     1        4
5     Priya   B.com   Una     2        5
6     Vish    ME      Rjk     2        2
7     Keval   B.sc    Pune    4        3
8     Vish    ME      Ahm     2        NaN
9     Keval   B.sc    Pune    4        NaN
10    Yesha   M.sc    Beng    2        NaN  
11    Vasu    M.Tech  Hyd     0        6
12    Kria    B.Tech  Mub     6        7

I want to increase the number of group_ID columns by name, city and grade column. If the value of the name, city and grade column is present in the dataframe, then the value of Group_ID is to be taken. if not present then increase the Group_ID value by a specific number or max(Group_ID 1).

What is the best way to get the result of the dataframe below?

ID    Name    Study   City   Grade   Group_ID
1     Vir     BE      Ahm     1        1
2     Vish    ME      Rjk     2        2
3     Keval   B.sc    Beng    4        3
4     Vir     M.sc    Ahm     1        4
5     Priya   B.com   Una     2        5
6     Vish    ME      Rjk     2        2
7     Keval   B.sc    Pune    4        3
8     Vish    ME      Ahm     2        2
9     Keval   B.sc    Pune    4        3
10    Yesha   M.sc    Beng    2        8  
11    Vasu    M.Tech  Hyd     0        6
12    Kria    B.Tech  Mub     6        7

I have tried by df['Group_ID'] = df.groupby(['Name', 'City', 'Grade'], sort=False).ngroup() 8 but in won't worked for present values of column. I believe there will be an easier way to do this.

CodePudding user response:

You can use a loop:

max_group_id = df["Group_ID"].max()
groups = df.set_index(["Name", "City", "Grade"])["Group_ID"].dropna().to_dict()
group_ids = []

for _, (name, city, grade, id_) in df[["Name", "City", "Grade", "Group_ID"]].iterrows():
    # If a row already has a Group ID, use that
    # If not, find the group with the same (name, city, grade)
    # If there is no such group, create a new group:
    id_ = id_ if pd.notna(id_) else groups.get((name, city, grade), np.nan)
    if np.isnan(id_):
        id_ = max_group_id   1
        max_group_id  = 1
        groups[(name, city, grade)] = id_
    group_ids.append(id_)

df["Group_ID_New"] = np.array(group_ids, dtype="int")

CodePudding user response:

Use:

#replace missing values by existing groups
s = df.groupby(["Name", "City", "Grade"])["Group_ID"].transform('first')
df['Group_ID'] = df['Group_ID'].fillna(s)
#mask for missing values of not existing groups
mask = s.isna()
#only for these groups use ngroup with add maximal Group_ID
df.loc[mask, 'Group_ID'] = (df[mask].groupby(['Name', 'City', 'Grade'], sort=False).ngroup()
                               df["Group_ID"].max()   1)
#convert to integers
df['Group_ID'] = df['Group_ID'].astype(int)
print (df)
    ID   Name   Study  City  Grade  Group_ID
0    1    Vir      BE   Ahm      1         1
1    2   Vish      ME   Rjk      2         2
2    3  Keval    B.sc  Beng      4         3
3    4    Vir    M.sc   Ahm      1         4
4    5  Priya   B.com   Una      2         5
5    6   Vish      ME   Rjk      2         2
6    7  Keval    B.sc  Pune      4         3
7    8   Vish      ME   Ahm      2         8
8    9  Keval    B.sc  Pune      4         3
9   10  Yesha    M.sc  Beng      2         9
10  11   Vasu  M.Tech   Hyd      0         6
11  12   Kria  B.Tech   Mub      6         7

Output is different, because Vish,Rjk,2 not match Vish,Ahm,2, so new Group_ID was created.

  • Related