I have a file say file1.rule
, which has even number of rows, the last column of that file represent fitness
and the second last column represent the class
. I want to pair the rows class wise(first picks the row with the highest fitness then a random one from the remaining), with just one condition that no two identical rows can form a pair. In my file, no two exactly identical row for a class can occur more than n/2
times where n
is the number of rows for that particular class.
Below is my file:
*,*,*,1,0,1.0
*,*,1,*,0,0.22
*,*,2,2,1,0.71
*,*,2,2,1,0.71
*,2,2,*,1,0.64
*,2,2,*,1,0.64
1,*,*,3,2,0.95
*,*,3,2,2,0.66
*,*,3,4,2,0.67
3,*,*,*,2,0.33
3,*,*,*,2,0.33
3,*,*,*,2,0.33
And the code for this :
rule_file_name = "file1.rule"
from collections import defaultdict
list1 = []
with open(rule_file_name) as rule_fp:
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
assert len(list1) & 1 == 0
classes = defaultdict(list)
for _list in list1:
classes[_list[4]].append(_list)
from random import sample, seed
seed(1)
for key, _list in classes.items():
assert len(_list) & 1 == 0
_list.sort(key=lambda x: x[5])
pairs = []
#while(len(_list)>2):
while _list:
#print(len(_list))
first = _list[-1]
candidate = sample(_list, 1)[0]
if first != candidate:
#print(f'first{first}, candidate{candidate}')
print(f'{first},{candidate}')
pairs.append((first, candidate))
_list.remove(first)
_list.remove(candidate)
classes[key] = pairs
The above code is working fine for class 0 and 1 and pairing is done but for class 2, the first 2 randomly chosen pairs are :
['1', '*', '*', '3', '2', '0.95'],['*', '*', '3', '2', '2', '0.66']
['*', '*', '3', '4', '2', '0.67'],['3', '*', '*', '*', '2', '0.33']
Now after these the remaining 2 rows of class 2 are: 3,*,*,*,2,0.33 and 3,*,*,*,2,0.33
which are identical so they can't form a pair and hence the while loop is running for infinite times.
According to my observation, this condition will only arrive when there are only last 2 rows left for any class, in this case, I simply want to discard those 2 rows. So I tried to replace the while condition writing: while(len(_list)>2):
, but in this case the last 2 will always be ignored even if they are completely different from each other. What to do?
Can I use any timer inside the while loop like below?
if some_condition or time.time() > timeout:
break
I tried to modify the code like this also:
while _list:
first = _list[-1]
_list.remove(first)
candidate = sample(_list, 1)[0]
if (len(_list)<=2) and first == candidate:
break
elif first != candidate:
#print(f'first{first}, candidate{candidate}')
print(f'{first},{candidate}')
pairs.append((first, candidate))
#_list.remove(first)
_list.remove(candidate)
classes[key] = pairs
But in this,when my file looks like below:
*,2,2,*,1,0.64
*,*,2,2,1,0.71
*,*,2,2,1,0.71
*,2,2,*,1,0.64
It is selecting ['*', '*', '2', '2', '1', '0.71'],['*', '2', '2', '*', '1', '0.64']
then I am getting error in candidate = sample(_list, 1)[0]
this line saying: ValueError: Sample larger than population or is negative
. Please help me out.
CodePudding user response:
This might be one direction to go in:
# ...
# construct the defaultdict classes as before.
from collections import Counter
from random import seed, choice
def attempt_partition_into_pairs(rows):
rows.sort(key=lambda x: x[5])
# convert to tuples so that the rows are hashable.
rows = [tuple(row) for row in rows]
# convert to a Counter so that we organize duplicates
counter = Counter(rows)
pairs = []
# ensure there are always at least two distinct rows
while len(counter) >= 2:
# choose and remove first
first = rows.pop()
counter[first] -= 1
if counter[first] == 0:
del counter[first]
# choose candidate
while True:
candidate = choice(rows)
if candidate != first:
break
counter[candidate] -= 1
if counter[candidate] == 0:
del counter[candidate]
rows.remove(candidate)
# output
assert candidate != first
print(f'{first},{candidate}')
pairs.append((first, candidate))
return pairs
seed(1)
for key, _list in classes.items():
assert len(_list) & 1 == 0
classes[key] = attempt_partition_into_pairs(_list)
This prints:
('*', '*', '*', '1', '0', '1.0'),('*', '*', '1', '*', '0', '0.22')
('*', '*', '2', '2', '1', '0.71'),('*', '2', '2', '*', '1', '0.64')
('*', '*', '2', '2', '1', '0.71'),('*', '2', '2', '*', '1', '0.64')
('1', '*', '*', '3', '2', '0.95'),('3', '*', '*', '*', '2', '0.33')
('*', '*', '3', '4', '2', '0.67'),('3', '*', '*', '*', '2', '0.33')
('*', '*', '3', '2', '2', '0.66'),('3', '*', '*', '*', '2', '0.33')
I've taken the liberty to factor out that function. Using a collections.Counter
helps keeps track of when there are duplicates. Also you can you random.choice()
rather than random.sample(..., 1)
.
This technically retains some O(n^2)
behavior at the rows.remove(candidate)
call, so it may be a bit slow if your input file is huge, but certainly no slower than your original attempt. You can probably fix that with some more cleverness, if need be.