I have the following dataframe in pandas and want to write a statement to compare person-name
with new-person
and print identifier
, person-name
, new-identifier
, new-person
identifier | person-name | type | name | new-identifier | new-person | new-type | new-name |
---|---|---|---|---|---|---|---|
(hockey, player) | sidney crosby | athlete | sidney | (pittsburg, player) | jane sidney | player | SC |
(hockey, player) | sidney crosby | athlete | sidney | (pittsburg, player) | crosby sidney | player | MS |
(hockey, player) | wayne gretzky | athlete | wayne | (oilers, player) | gretzky-wayne | player | WG |
(hockey, player) | wayne gretzky | athlete | wayne | (oilers, player) | gretzky-wayne | player | TP |
Basically need to find sidney crosby and crosby sidney in the same data frame. I guess the logic would be if person-name = sidney crosby and new-person = crosby sidney, the output would be:
identifier | person-name | new-identifier | new-person |
---|---|---|---|
(hockey, player) | sidney crosby | (pittsburg, player) | crosby-sidney |
df[person-name].equals(df[new-person])
wouldn't work since I'm comparing contents in the column rather than the entire column. How can I compare the contents of those 2 columns and print the 4 columns
CodePudding user response:
here is one to way to do it
split the names and sort each of the two 'person-name' and 'new-person' and then compare the sorted names
df[
df['person-name']
.apply(lambda x: ' '.join(sorted(x.split(' '))))
.eq(
df['new-person']
.replace(r'-',' ',regex=True)
.apply(lambda x: ' '.join(sorted(x.split(' ')))))][
['identifier','person-name','new-identifier','new-person']
]
identifier person-name new-identifier new-person
1 (hockey, player) sidney crosby (pittsburg, player) crosby sidney
2 (hockey, player) wayne gretzky (oilers, player) gretzky-wayne
3 (hockey, player) wayne gretzky (oilers, player) gretzky-wayne
to keep only the unique, assign the result to another DF and then keep only non duplicated rows
df2=df[
df['person-name']
.apply(lambda x: ' '.join(sorted(x.split(' '))))
.eq(
df['new-person']
.replace(r'-',' ',regex=True)
.apply(lambda x: ' '.join(sorted(x.split(' ')))))][
['identifier','person-name','new-identifier','new-person']
]
df2[~df2.duplicated(keep='first')]
identifier person-name new-identifier new-person
1 (hockey, player) sidney crosby (pittsburg, player) crosby sidney
2 (hockey, player) wayne gretzky (oilers, player) gretzky-wayne