I have a frame that looks like this:
x y score
Cat Dog 0.1
Dog Monkey 0.5
Fish Cat 0.2
Cat Fish 0.2
Monkey Dog 0.4
Dog Cat 0.7
Basically I want a new boolean column that looks at all pairs and their reverse, for example (Cat, Dog) and (Dog,Cat) and assign a True for when the score of a pair is greater than the score of the reversed pair, otherwise False. If there is equality, both pairs are assigned True. That is:
x y score bool
Cat Dog 0.1 False # False because Dog,Cat > Cat,Dog
Dog Monkey 0.5 True # True because Dog,Monkey > Monkey, Dog
Fish Cat 0.2 True # True because both have the same score
Cat Fish 0.2 True # ...
Monkey Dog 0.4 False
Dog Cat 0.7 True
I am certain I can get a function done by applying on the rows a filter for the reverse and then return a checks on the score; however, the list can be quite long (n>100k) and time is a factor. It turns out this is trickier than I thought. I was wondering if this can be done in a more pythonic way with a magic Pandas function that haven't encoutered yet or a rolling window.
Note:
- Every pair (x,y) is unique.
- For every pair (x,y) there is exactly one reverse (y,x).
- There can be multiple occurences of the same category within the same column, e.g. Cat appears twice in this frame for x.
- Scores range from 0-1.
Frame:
df = pd.DataFrame.from_records(zip(["Cat","Dog","Fish","Cat","Monkey","Dog"], ["Dog","Monkey","Cat","Fish","Dog","Cat"], [0.1,0.5,0.2,0.2,0.4,0.7]),columns=["x","y","score"])
CodePudding user response:
You can swap columns x
& y
, merge on x/y pair, and compare score vs reversed score:
(df.merge(
df.assign(x = df.y, y = df.x)
.rename(columns={'score': 'rev_score'})
).assign(bool = lambda x: x.score >= x.rev_score)
.drop('rev_score', axis=1))
x y score bool
0 Cat Dog 0.1 False
1 Dog Monkey 0.5 True
2 Fish Cat 0.2 True
3 Cat Fish 0.2 True
4 Monkey Dog 0.4 False
5 Dog Cat 0.7 True