Home > other >  comparing contents of 2 columns in 1 data frame and outputting row
comparing contents of 2 columns in 1 data frame and outputting row

Time:06-14

I have the following dataframe in pandas and want to write a statement to compare person-name with new-person and print identifier , person-name, new-identifier , new-person

identifier person-name type name new-identifier new-person new-type new-name
(hockey, player) sidney crosby athlete sidney (pittsburg, player) jane sidney player SC
(hockey, player) sidney crosby athlete sidney (pittsburg, player) crosby sidney player MS
(hockey, player) wayne gretzky athlete wayne (oilers, player) gretzky-wayne player WG
(hockey, player) wayne gretzky athlete wayne (oilers, player) gretzky-wayne player TP

Basically need to find sidney crosby and crosby sidney in the same data frame. I guess the logic would be if person-name = sidney crosby and new-person = crosby sidney, the output would be:

identifier person-name new-identifier new-person
(hockey, player) sidney crosby (pittsburg, player) crosby-sidney
 df[person-name].equals(df[new-person]) 

wouldn't work since I'm comparing contents in the column rather than the entire column. How can I compare the contents of those 2 columns and print the 4 columns

CodePudding user response:

here is one to way to do it

split the names and sort each of the two 'person-name' and 'new-person' and then compare the sorted names

df[
    df['person-name']
    .apply(lambda x:  ' '.join(sorted(x.split(' '))))
    .eq(
    df['new-person']
        .replace(r'-',' ',regex=True)
        .apply(lambda x:  ' '.join(sorted(x.split(' ')))))][
    ['identifier','person-name','new-identifier','new-person']
]

    identifier       person-name    new-identifier       new-person
1   (hockey, player) sidney crosby  (pittsburg, player)  crosby sidney
2   (hockey, player) wayne gretzky  (oilers, player)     gretzky-wayne
3   (hockey, player) wayne gretzky  (oilers, player)     gretzky-wayne

to keep only the unique, assign the result to another DF and then keep only non duplicated rows

df2=df[
    df['person-name']
    .apply(lambda x:  ' '.join(sorted(x.split(' '))))
    .eq(
    df['new-person']
        .replace(r'-',' ',regex=True)
        .apply(lambda x:  ' '.join(sorted(x.split(' ')))))][
    ['identifier','person-name','new-identifier','new-person']
]
df2[~df2.duplicated(keep='first')]
    identifier       person-name    new-identifier      new-person
1   (hockey, player) sidney crosby  (pittsburg, player) crosby sidney
2   (hockey, player) wayne gretzky  (oilers, player)    gretzky-wayne
  • Related