I have a data frame like this:
d = {'id1':['3def','xyx1','roro','p344o','fjfje'],
'id2':['hhyy','drdr','xyx1','xyx1','xyx1']
,'score':[100,90,100,90,70]
}
df1 = pd.DataFrame(d)
df1
id1 id2 score
0 3def hhyy 100
1 xyx1 drdr 90
2 roro xyx1 100
3 p344o xyx1 90
4 fjfje xyx1 70
My end goal is to give a unique key to each row. The problem is that i would like the set of rows where the id1 equals the id2 & have score more than 80 to have a common key (in my case index 1,2,3). So i would like my desired result to be like this:
id1 id2 score uid
0 3def hhyy 100 1
1 xyx1 drdr 90 2
2 roro xyx1 100 2
3 p344o xyx1 90 2
4 fjfje xyx1 70 3
I have done something but doesn't seem to do the trick:
msk = (((df1['id1'].isin(df1['id2']).shift(-1)) (df1['id2'].isin(df1['id2']).shift(-1))) &\
(df1['score'] > 80).shift(-1))
df1['uid'] = msk.cumsum()
print(df1)
id1 id2 score uid
0 3def hhyy 100 1
1 xyx1 drdr 90 2
2 roro xyx1 100 3
3 p344o xyx1 90 3
4 fjfje xyx1 70 3
Any help on how to solve this would be very appreciated.
CodePudding user response:
You can use the following algorithm:
- create a uid per group of id1
- map those uids on the value of id2 and replace the uid if score matches the criterion
- eventually fix the non sequential uid by subtracting the extra difference
# step 1
df1['uid'] = df1.groupby('id1', sort=False).ngroup().add(1)
# step 2
alt_uid = df1['id2'].map(df1.set_index('id1')['uid']).convert_dtypes()
# step 3
df1['uid'] = df1['uid'].mask(df1['score'].gt(80) & alt_uid.notna(), alt_uid)
# final uid
df1['uid'] -= df1['uid'].diff().sub(1).fillna(0).clip(lower=0)
output:
id1 id2 score uid
0 3def hhyy 100 1
1 xyx1 drdr 90 2
2 roro xyx1 100 2
3 p344o xyx1 90 2
4 fjfje xyx1 70 3
output with intermediary steps:
id1 id2 score uid uid_step1 uid_step2 uid_step3
0 3def hhyy 100 1 1 <NA> 1
1 xyx1 drdr 90 2 2 <NA> 2
2 roro xyx1 100 2 3 2 2
3 p344o xyx1 90 2 4 2 2
4 fjfje xyx1 70 3 5 2 5
CodePudding user response:
I think its easier to create a custom function:
def func(data):
idx, cur_id, cur_score, result = 0, None, None, []
for x, y, z in data.values:
if not ((cur_id == y) & (z > 80)):
cur_id, cur_score = x, z
idx = 1
result.append(idx)
return result
df["new"] = func(df)
print (df)
id1 id2 score new
0 3def hhyy 100 1
1 xyx1 drdr 90 2
2 roro xyx1 100 2
3 p344o xyx1 90 2
4 fjfje xyx1 70 3