serial | name | match |
---|---|---|
1 | John | 5,6,8 |
2 | Steve | 1,7 |
3 | Kevin | 4 |
4 | Kevin | 3 |
5 | John | 1,6,8 |
6 | Johnn | 1,5,8 |
7 | Steves | 2 |
8 | John | 1,5,6 |
Need to check and match the name of each row with the name of the row serial number mentioned in it's match column. Keep the serials matching else remove it. If nothing is matching then put null.
serial | name | match | updated_match |
---|---|---|---|
1 | John | 5,6,8 | 5,8 |
2 | Steve | 1,7 | |
3 | Kevin | 4 | 4 |
4 | Kevin | 3 | 3 |
5 | John | 1,6,8 | 1,8 |
6 | Johnn | 1,5,8 | |
7 | Steves | 2 | |
8 | John | 1,5,6 | 1,5 |
CodePudding user response:
Convert values of serial
column to strings,then mapping aggregate set
s to Series wth same size like original, split column match
and get difference with intersection of sets, last sorting and join back to strings:
s = df['serial'].astype(str)
sets = df['name'].map(s.groupby(df['name']).agg(set))
match = df['match'].str.split(',')
df['updated_match'] = [','.join(sorted(b.difference([c]).intersection(a)))
for a, b, c in zip(match, sets, s)]
print (df)
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
CodePudding user response:
You can use mappings to determine which match have the same names:
# ensure we have strings in serial
df = df.astype({'serial': 'str'})
# split and explode he individual serials
s = df['match'].str.split(',').explode()
# make a mapper: serial -> name
mapper = df.set_index('serial')['name']
# get exploded names
s2 = df.loc[s.index, 'serial'].map(mapper)
# keep only match with matching names
# aggregate back as string
df['updated_match'] = (s[s.map(mapper).eq(s2)]
.groupby(level=0).agg(','.join)
.reindex(df.index, fill_value='')
)
output:
serial name match updated_match
0 1 John 5,6,8 5,8
1 2 Steve 1,7
2 3 Kevin 4 4
3 4 Kevin 3 3
4 5 John 1,6,8 1,8
5 6 Johnn 1,5,8
6 7 Steves 2
7 8 John 1,5,6 1,5
Alternative with a groupby.apply
:
df['updated_match'] = (df.groupby('name', group_keys=False)
.apply(lambda g: g['match'].str.split(',').explode()
.loc[lambda x: x.isin(g['serial'])]
.groupby(level=0).agg(','.join))
.reindex(df.index, fill_value='')
)