I have a couple of datasets with names and ids. One person can be present in a dataset more then one time then this person has the same id in the dataset. The name of person can differ within the dataset and over all datasets. I need to assign UID to each person across all datasets.
This is a example dataset, where two datasets were merged into one. Ids starting with letter 'a' are from first dataset and ids starting with 'b' from the second one.
df = pd.DataFrame({'name': ['Bob', 'Brb R.','Alice', 'John', 'Jn D.','Suzy', 'bob', 'greg', 'john', 'suzy', 'suz Q.', 'sz Qual'],
'id': ['a1', 'a1', 'a2', 'a3', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4', 'b4', 'b4']})
name id
0 Bob a1
1 Brb R. a1
2 Alice a2
3 John a3
4 Jn D. a3
5 Suzy a4
6 bob b1
7 greg b2
8 john b3
9 suzy b4
10 suz Q. b4
11 sz Qual b4
My idea was to set UID to groups with same id which woud result in something like this:
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u5
7 greg b2 u6
8 john b3 u7
9 suzy b4 u8
10 suz Q. b4 u8
11 sz Qual b4 u8
Then I would find a match over lowercased names and assign the same uid over datasets. I would like to know how can I detect the change of UID in one group and change all others in the group accordingly. Let's say I'm gonna find a match on the name 'suzy' and I will assign uid 'u4' to 'suzy' with id 'b4', how could I assign uid 'u4' to all others in the group with id 'b4' (suz Q., sz Qual).
The result should look like this:
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u1
7 greg b2 u6
8 john b3 u7
9 suzy b4 u4
10 suz Q. b4 u4
11 sz Qual b4 u4
uniq = df['id'].unique()
df['uid'] = np.nan
for id in uniq:
uid = uuid.uuid4().hex
df.loc[df['id'] == id, "uid"] = uid
df['names lower'] = df['name'].str.lower()
Thank you for any advice.
CodePudding user response:
You can use:
df['uid'] = 'u' df['id'].ne(df['id'].shift()).cumsum().astype(str)
print(df)
# Output
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u5
7 greg b2 u6
8 john b3 u7
9 suzy b4 u8
10 suz Q. b4 u8
11 sz Qual b4 u8
CodePudding user response:
Try with groupby
and ngroup
:
df["uid"] = "u" df.groupby("id").ngroup().add(1).astype(str)
>>> df
name id uid
0 Bob a1 u1
1 Brb R. a1 u1
2 Alice a2 u2
3 John a3 u3
4 Jn D. a3 u3
5 Suzy a4 u4
6 bob b1 u5
7 greg b2 u6
8 john b3 u7
9 suzy b4 u8
10 suz Q. b4 u8
11 sz Qual b4 u8