Pandas: Detect change in a group and change all other entries-CodePudding

I have a couple of datasets with names and ids. One person can be present in a dataset more then one time then this person has the same id in the dataset. The name of person can differ within the dataset and over all datasets. I need to assign UID to each person across all datasets.

This is a example dataset, where two datasets were merged into one. Ids starting with letter 'a' are from first dataset and ids starting with 'b' from the second one.

df = pd.DataFrame({'name': ['Bob', 'Brb R.','Alice', 'John', 'Jn D.','Suzy', 'bob', 'greg', 'john', 'suzy', 'suz Q.', 'sz Qual'],
                    'id':  ['a1',   'a1',   'a2',     'a3',   'a3',    'a4',  'b1',  'b2',   'b3',   'b4',    'b4',     'b4']})

    name    id
0   Bob     a1
1   Brb R.  a1
2   Alice   a2
3   John    a3
4   Jn D.   a3
5   Suzy    a4
6   bob     b1
7   greg    b2
8   john    b3
9   suzy    b4
10  suz Q.  b4
11  sz Qual     b4

My idea was to set UID to groups with same id which woud result in something like this:

    name     id    uid
0   Bob      a1     u1
1   Brb R.   a1     u1
2   Alice    a2     u2
3   John     a3     u3
4   Jn D.    a3     u3
5   Suzy     a4     u4
6   bob      b1     u5
7   greg     b2     u6
8   john     b3     u7
9   suzy     b4     u8
10  suz Q.   b4     u8
11  sz Qual  b4     u8

Then I would find a match over lowercased names and assign the same uid over datasets. I would like to know how can I detect the change of UID in one group and change all others in the group accordingly. Let's say I'm gonna find a match on the name 'suzy' and I will assign uid 'u4' to 'suzy' with id 'b4', how could I assign uid 'u4' to all others in the group with id 'b4' (suz Q., sz Qual).

The result should look like this:

    name     id     uid
0   Bob      a1     u1
1   Brb R.   a1     u1
2   Alice    a2     u2
3   John     a3     u3
4   Jn D.    a3     u3
5   Suzy     a4     u4
6   bob      b1     u1
7   greg     b2     u6
8   john     b3     u7
9   suzy     b4     u4
10  suz Q.   b4     u4
11  sz Qual  b4     u4

uniq = df['id'].unique()
df['uid'] = np.nan
for id in uniq:
    uid = uuid.uuid4().hex
    df.loc[df['id'] == id, "uid"] = uid

df['names lower'] = df['name'].str.lower()

Thank you for any advice.

CodePudding user response：

You can use:

df['uid'] = 'u'   df['id'].ne(df['id'].shift()).cumsum().astype(str)
print(df)

# Output
       name  id uid
0       Bob  a1  u1
1    Brb R.  a1  u1
2     Alice  a2  u2
3      John  a3  u3
4     Jn D.  a3  u3
5      Suzy  a4  u4
6       bob  b1  u5
7      greg  b2  u6
8      john  b3  u7
9      suzy  b4  u8
10   suz Q.  b4  u8
11  sz Qual  b4  u8

CodePudding user response：

Try with groupby and ngroup:

df["uid"] = "u" df.groupby("id").ngroup().add(1).astype(str)

>>> df
       name  id uid
0       Bob  a1  u1
1    Brb R.  a1  u1
2     Alice  a2  u2
3      John  a3  u3
4     Jn D.  a3  u3
5      Suzy  a4  u4
6       bob  b1  u5
7      greg  b2  u6
8      john  b3  u7
9      suzy  b4  u8
10   suz Q.  b4  u8
11  sz Qual  b4  u8