Home > Net >  Give unique keys in a pandas df with rows that satisfy condition having common key
Give unique keys in a pandas df with rows that satisfy condition having common key

Time:11-26

I have a data frame like this:

d = {'id1':['3def','xyx1','roro','p344o','fjfje'],
     'id2':['hhyy','drdr','xyx1','xyx1','xyx1']
     ,'score':[100,90,100,90,70]
    }
df1 = pd.DataFrame(d)
df1
        id1  id2    score
0   3def    hhyy    100
1   xyx1    drdr    90
2   roro    xyx1    100
3   p344o   xyx1    90
4   fjfje   xyx1    70

My end goal is to give a unique key to each row. The problem is that i would like the set of rows where the id1 equals the id2 & have score more than 80 to have a common key (in my case index 1,2,3). So i would like my desired result to be like this:

    id1     id2     score    uid
0   3def    hhyy    100      1
1   xyx1    drdr    90       2
2   roro    xyx1    100      2
3   p344o   xyx1    90       2
4   fjfje   xyx1    70       3

I have done something but doesn't seem to do the trick:

msk = (((df1['id1'].isin(df1['id2']).shift(-1)) (df1['id2'].isin(df1['id2']).shift(-1))) &\
           (df1['score'] > 80).shift(-1))

df1['uid'] = msk.cumsum()

print(df1)
     id1      id2   score   uid
0   3def    hhyy    100     1
1   xyx1    drdr    90      2
2   roro    xyx1    100     3
3   p344o   xyx1    90      3
4   fjfje   xyx1    70      3

Any help on how to solve this would be very appreciated.

CodePudding user response:

You can use the following algorithm:

  • create a uid per group of id1
  • map those uids on the value of id2 and replace the uid if score matches the criterion
  • eventually fix the non sequential uid by subtracting the extra difference
# step 1
df1['uid'] = df1.groupby('id1', sort=False).ngroup().add(1)

# step 2
alt_uid = df1['id2'].map(df1.set_index('id1')['uid']).convert_dtypes()

# step 3
df1['uid'] = df1['uid'].mask(df1['score'].gt(80) & alt_uid.notna(), alt_uid)

# final uid
df1['uid'] -= df1['uid'].diff().sub(1).fillna(0).clip(lower=0)

output:

     id1   id2  score  uid
0   3def  hhyy    100    1
1   xyx1  drdr     90    2
2   roro  xyx1    100    2
3  p344o  xyx1     90    2
4  fjfje  xyx1     70    3

output with intermediary steps:

     id1   id2  score  uid  uid_step1  uid_step2  uid_step3
0   3def  hhyy    100    1          1       <NA>          1
1   xyx1  drdr     90    2          2       <NA>          2
2   roro  xyx1    100    2          3          2          2
3  p344o  xyx1     90    2          4          2          2
4  fjfje  xyx1     70    3          5          2          5

CodePudding user response:

I think its easier to create a custom function:

def func(data):
    idx, cur_id, cur_score, result = 0, None, None, []
    for x, y, z in data.values:
        if not ((cur_id == y) & (z > 80)):
            cur_id, cur_score = x, z
            idx  = 1
        result.append(idx)
    return result

df["new"] = func(df)

print (df)

     id1   id2  score  new
0   3def  hhyy    100    1
1   xyx1  drdr     90    2
2   roro  xyx1    100    2
3  p344o  xyx1     90    2
4  fjfje  xyx1     70    3
  • Related