This a puzzling one. I am refactoring some code, so key-objective is/was to keep things simple.
The piece of code is supposed to locate existing reciprocal pairs in the date, so I can process them (in another function). It has to return the location of reciprocals, the lines of the df where they occur. Reciprocal meaning (a,b) == (b,a).
Explaining my scenario: I did my best to recreate the situation in plain data below. The original data set is indeed a list of all possible permutations of "n" elements in pairs. At that point in the code (a,b) could be a different thing vs. (b,a) hence I need to keep them there and work with all of them.
I used "random" to simulate the fact that at the point of the code where I am, some of the reciprocals have been already eliminated thru other filters/processing.
Then the function I am working in now, "reciprocals_locator", should point me to the rows of the the dataframe where reciprocals still occur and need to be further processed.
Original solution, which I am trying to refactor, is a complicated loop that manipulated the dataframe way within the loop. All of which I did not like to have in my final code. I was able to refactor that loop to create the list I need and then refactor that into a list comprehension, which is better, but nonetheless still quite unreadable.
So here goes question (1): Is there a better way/algorithm or even external/imported function to do the trick in a leaner way? Now, the above solution is somewhat incomplete in the sense it returns a list which also contains reciprocal-duplicates of the locations!
As shown below, I end up with knowing that there are duplicates in rows (1,5) but it also shows the reciprocal solution (5,1)! I initially thought this was a simple case of "pick the half" of the resulting list, but that won't do the trick probably because of the way itertools.product
works, which frankly I haven't deeply studied.
Same happens on my original, looping, code implementation. Maybe there's a simpler, itertools-based
solution to this I am now aware of.
**Hope all code below is simple enough to follow
### Problem scenario recreation
import itertools as it
import pandas as pd
import random as rd
## Original dataset
sourcelst = ['a','b','c','d','e']
pairs = [list(perm) for perm in it.permutations(sourcelst,2)]
df = pd.DataFrame(pairs, columns=['y','x'])
## Dataset after some sort of processing, some reciprocals are eliminated,
## but some stay and we nee to process them separately
drop_rows = [rd.randint(0, len(pairs)-1) for _ in range(2)]
df.drop(index=drop_rows, inplace=True)
df.reset_index(inplace=True, drop=True)
## Finding reciprocals
### Original LOOPS implementation, this one shows the actual pairs
### for ease of analysis
def reciprocal_pairs_orig(df):
reciprocals = []
row_y = 0
while row_y < len(df.index):
for row_x in range(len(df.index)):
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y]):
reciprocals.append([[df['y'].iloc[row_x], df['x'].iloc[row_x]],[df['y'].iloc[row_y], df['x'].iloc[row_y]]])
row_y = 1
return reciprocals
### List comprehension refactor, showing the pairs
def reciprocal_pairs_refactor(df):
return [[[df['y'].iloc[row_x], df['x'].iloc[row_x]], [df['y'].iloc[row_y], df['x'].iloc[row_y]]]
for row_y, row_x in it.product(range(len(df.index)), range(len(df.index)))
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y])]
### This is the actual function that I want to use
def reciprocal_locator_orig(df):
reciprocals = []
row_y = 0
while row_y < len(df.index):
for row_x in range(len(df.index)):
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y]):
reciprocals.append([row_y, row_x])
row_y = 1
return reciprocals
### List comprehension refactor
def reciprocal_locator_refactor(df):
return [[row_y, row_x]
for row_y, row_x in it.product(range(len(df.index)), range(len(df.index)))
if (df['x'].iloc[row_x] == df['y'].iloc[row_y]) and (df['y'].iloc[row_x] == df['x'].iloc[row_y])]
CodePudding user response:
product(range(n),range(n))
will give you n**2
pairs - for instance, as you noted, both (1,5)
and (5,1)
, so it's not a good solution. Your original code, with the nested for
loops, does the same.
A first improvement would be
for x in range(n-1):
for y in range(x,n):
...
This would still be O(n2), but at least it would check each possible pairing only once.
But a different approach may be more efficient: loop over the rows only once, while maintaining a dict
of frozenset
s (normal set
s are not hashable) containing the pairs you already met as keys, and the row numbers as values . For each row you simply check if the pair is already in the dict
- then it's a "reciprocal" between the current line and the line whose number is stored in the dict
- else you add the new pair, and so on. Soething like:
seen = {}
for row in range(len(df.index)):
fs = frozenset(df['x'].iloc[row], df['y'].iloc[row])
if fs in seen:
reciprocals.append((seen[fs],row))
else:
seen[fs] = row