Pandas - dense rank but keep current group numbers-CodePudding

I'm dealing with pandas dataframe and have a frame like:

data = {
  "name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
  "id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)

----------------------

     name  id
0  Andrew   3
1  Andrew   3
2   James   1
3   James   0
4    Mary   0
5  Andrew   0
6  Michael  2

I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers. If the value is 0, it means that there is no assignment. For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.

The expected output:

     name  id
0  Andrew   3
1  Andrew   3
2   James   1
3   James   1
4    Mary   4
5  Andrew   3
6  Michael  2

I've spent time already trying to figure this out. I managed to get to something like this:

df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))

The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.

Can u please support me?

CodePudding user response：

Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:

df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')

s = df.loc[df["id"].isna(), 'name'].rank(method='dense')   df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
      name  id
0   Andrew   3
1   Andrew   3
2    James   1
3    James   1
4     Mary   4
5   Andrew   3
6  Michael   2

CodePudding user response：

IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):

# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)

# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0] df['id'].max() 1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense') df['id'].max()

# optional, if you want integers
df['id']= df['id'].convert_dtypes()

output:

      name  id
0   Andrew   3
1   Andrew   3
2    James   1
3    James   1
4     Mary   4
5   Andrew   3
6  Michael   2