Pairing rows of a file almost randomly-CodePudding

I have a file say file1.rule, which has even number of rows, the last column of that file represent fitness and the second last column represent the class. I want to pair the rows class wise(first picks the row with the highest fitness then a random one from the remaining), with just one condition that no two identical rows can form a pair. In my file, no two exactly identical row for a class can occur more than n/2 times where n is the number of rows for that particular class.

Below is my file:

*,*,*,1,0,1.0
*,*,1,*,0,0.22
*,*,2,2,1,0.71
*,*,2,2,1,0.71
*,2,2,*,1,0.64
*,2,2,*,1,0.64
1,*,*,3,2,0.95
*,*,3,2,2,0.66
*,*,3,4,2,0.67
3,*,*,*,2,0.33
3,*,*,*,2,0.33
3,*,*,*,2,0.33

And the code for this :

rule_file_name = "file1.rule"
from collections import defaultdict
list1 = []

with open(rule_file_name) as rule_fp:
    for line in rule_fp.readlines():
        list1.append(line.replace("\n","").split(","))

assert len(list1) & 1 == 0
classes = defaultdict(list)
for _list in list1:
    classes[_list[4]].append(_list)

    
from random import sample, seed
seed(1)
for key, _list in classes.items():
    assert len(_list) & 1 == 0
    _list.sort(key=lambda x: x[5])
    pairs = []
    #while(len(_list)>2):
    while _list:
        #print(len(_list))
        first = _list[-1]
        candidate = sample(_list, 1)[0]
        if first != candidate:
            #print(f'first{first}, candidate{candidate}')
            print(f'{first},{candidate}')
            pairs.append((first, candidate))
            _list.remove(first)
            _list.remove(candidate)
    classes[key] = pairs

The above code is working fine for class 0 and 1 and pairing is done but for class 2, the first 2 randomly chosen pairs are :

['1', '*', '*', '3', '2', '0.95'],['*', '*', '3', '2', '2', '0.66']
['*', '*', '3', '4', '2', '0.67'],['3', '*', '*', '*', '2', '0.33']

Now after these the remaining 2 rows of class 2 are: 3,*,*,*,2,0.33 and 3,*,*,*,2,0.33 which are identical so they can't form a pair and hence the while loop is running for infinite times.

According to my observation, this condition will only arrive when there are only last 2 rows left for any class, in this case, I simply want to discard those 2 rows. So I tried to replace the while condition writing: while(len(_list)>2): , but in this case the last 2 will always be ignored even if they are completely different from each other. What to do?

Can I use any timer inside the while loop like below?

if some_condition or time.time() > timeout:
        break

I tried to modify the code like this also:

while _list:
    first = _list[-1]
    _list.remove(first)
    candidate = sample(_list, 1)[0]
    if (len(_list)<=2) and first == candidate:
        break
    elif first != candidate:
        #print(f'first{first}, candidate{candidate}')
        print(f'{first},{candidate}')
        pairs.append((first, candidate))
        #_list.remove(first)
        _list.remove(candidate)
classes[key] = pairs

But in this,when my file looks like below:

*,2,2,*,1,0.64
*,*,2,2,1,0.71
*,*,2,2,1,0.71
*,2,2,*,1,0.64

It is selecting ['*', '*', '2', '2', '1', '0.71'],['*', '2', '2', '*', '1', '0.64'] then I am getting error in candidate = sample(_list, 1)[0] this line saying: ValueError: Sample larger than population or is negative. Please help me out.

CodePudding user response：

This might be one direction to go in:

# ...
# construct the defaultdict classes as before.

from collections import Counter
from random import seed, choice


def attempt_partition_into_pairs(rows):
    rows.sort(key=lambda x: x[5])
    # convert to tuples so that the rows are hashable.
    rows = [tuple(row) for row in rows]
    # convert to a Counter so that we organize duplicates
    counter = Counter(rows)

    pairs = []

    # ensure there are always at least two distinct rows
    while len(counter) >= 2:
        # choose and remove first
        first = rows.pop()
        counter[first] -= 1
        if counter[first] == 0:
            del counter[first]

        # choose candidate
        while True:
            candidate = choice(rows)
            if candidate != first:
                break
        counter[candidate] -= 1
        if counter[candidate] == 0:
            del counter[candidate]
        rows.remove(candidate)

        # output
        assert candidate != first
        print(f'{first},{candidate}')
        pairs.append((first, candidate))

    return pairs


seed(1)
for key, _list in classes.items():
    assert len(_list) & 1 == 0
    classes[key] = attempt_partition_into_pairs(_list)

This prints:

('*', '*', '*', '1', '0', '1.0'),('*', '*', '1', '*', '0', '0.22')
('*', '*', '2', '2', '1', '0.71'),('*', '2', '2', '*', '1', '0.64')
('*', '*', '2', '2', '1', '0.71'),('*', '2', '2', '*', '1', '0.64')
('1', '*', '*', '3', '2', '0.95'),('3', '*', '*', '*', '2', '0.33')
('*', '*', '3', '4', '2', '0.67'),('3', '*', '*', '*', '2', '0.33')
('*', '*', '3', '2', '2', '0.66'),('3', '*', '*', '*', '2', '0.33')

I've taken the liberty to factor out that function. Using a collections.Counter helps keeps track of when there are duplicates. Also you can you random.choice() rather than random.sample(..., 1).

This technically retains some O(n^2) behavior at the rows.remove(candidate) call, so it may be a bit slow if your input file is huge, but certainly no slower than your original attempt. You can probably fix that with some more cleverness, if need be.