comparing contents of 2 columns in 1 data frame and outputting row-CodePudding

I have the following dataframe in pandas and want to write a statement to compare person-name with new-person and print identifier , person-name, new-identifier , new-person

identifier	person-name	type	name	new-identifier	new-person	new-type	new-name
(hockey, player)	sidney crosby	athlete	sidney	(pittsburg, player)	jane sidney	player	SC
(hockey, player)	sidney crosby	athlete	sidney	(pittsburg, player)	crosby sidney	player	MS
(hockey, player)	wayne gretzky	athlete	wayne	(oilers, player)	gretzky-wayne	player	WG
(hockey, player)	wayne gretzky	athlete	wayne	(oilers, player)	gretzky-wayne	player	TP

Basically need to find sidney crosby and crosby sidney in the same data frame. I guess the logic would be if person-name = sidney crosby and new-person = crosby sidney, the output would be:

identifier	person-name	new-identifier	new-person
(hockey, player)	sidney crosby	(pittsburg, player)	crosby-sidney

 df[person-name].equals(df[new-person])

wouldn't work since I'm comparing contents in the column rather than the entire column. How can I compare the contents of those 2 columns and print the 4 columns

CodePudding user response：

here is one to way to do it

split the names and sort each of the two 'person-name' and 'new-person' and then compare the sorted names

df[
    df['person-name']
    .apply(lambda x:  ' '.join(sorted(x.split(' '))))
    .eq(
    df['new-person']
        .replace(r'-',' ',regex=True)
        .apply(lambda x:  ' '.join(sorted(x.split(' ')))))][
    ['identifier','person-name','new-identifier','new-person']
]


    identifier       person-name    new-identifier       new-person
1   (hockey, player) sidney crosby  (pittsburg, player)  crosby sidney
2   (hockey, player) wayne gretzky  (oilers, player)     gretzky-wayne
3   (hockey, player) wayne gretzky  (oilers, player)     gretzky-wayne

to keep only the unique, assign the result to another DF and then keep only non duplicated rows

df2=df[
    df['person-name']
    .apply(lambda x:  ' '.join(sorted(x.split(' '))))
    .eq(
    df['new-person']
        .replace(r'-',' ',regex=True)
        .apply(lambda x:  ' '.join(sorted(x.split(' ')))))][
    ['identifier','person-name','new-identifier','new-person']
]
df2[~df2.duplicated(keep='first')]

    identifier       person-name    new-identifier      new-person
1   (hockey, player) sidney crosby  (pittsburg, player) crosby sidney
2   (hockey, player) wayne gretzky  (oilers, player)    gretzky-wayne