Pairing elements of list of lists and storing in tuple form-CodePudding

I have a file say : file1.txt, which has multiple rows and columns. I want to read that and store that as list of lists. Now I want to pair them using the logic, no 2 same rows can be in a pair. Now the 2nd lastcolumn represent the class. Below is my file:

27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92

Here all the 6 rows are class 1. I am using below logic to do this pairing part.

from operator import itemgetter

rule_file_name = 'file1.txt'
rule_fp = open(rule_file_name)

list1 = []
for line in rule_fp.readlines():
    list1.append(line.replace("\n","").split(","))

list1=sorted(list1,key=itemgetter(-1),reverse=True)

length = len(list1)
middle_index = length // 2
first_half = list1[:middle_index]
second_half = list1[middle_index:]
result=[]
result=list(zip(first_half,second_half))

for a,b in result:
    if a==b:
        result.remove((a, b))

print(result)
print("-------------------")

It is working absolutely fine when I have one class only. But if my file has multiple classes then I want the pairing to be done with is the same class only. For an example if my file looks like below: say file2

27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97

Then I want to make 3 pairs from class 1, 2 from class 2 and 1 from class 3.Then I am using this logic to make the dictionary where the keys will be the classes.

d = {}
sorted_grouped = []
for row in list1:
    # Add name to dict if not exists
    if row[-2] not in d:
        d[row[-2]] = []
    # Add all non-Name attributes as a new list
    d[row[-2]].append(row)
#print(d.items())

for k,v in d.items():
    sorted_grouped.append(v)
#print(sorted_grouped)

gp_vals = {}
for i in sorted_grouped:
    gp_vals[i[0][-2]] = i
print(gp_vals)

Now how can I do it, please help !

My desired output for file2 is:

[([43,44,45,46,1,0.92], [39,40,41,42,1,0.82]), ([43,44,45,46,1,0.92], [27,28,29,30,1,0.67]), ([31,32,33,34,1,0.84], [35,36,37,38,1,0.45])] [([55,56,57,58,2,0.77], [59,60,61,62,2,0.39]), ([63,64,65,66,2,0.41], [51,52,53,54,2,0.28])] [([90,91,92,93,3,0.97], [75,76,77,78,3,0.51])]

Edit1:

All the files will have even number of rows, where every class will have even number of rows as well.
For a particular class(say class 2), if there are n rows then there can be maximum n/2 identical rows for that class in the dataset.
My primary intention was to get random pairing but making sure no self pairing is allowed. For that I thought of taking the row with the highest fitness value(The last column) inside any class and take any other row from that class randomly and make a pair just by making sure both the rows are not exactly the same. And this same thing is repeated for every class separately.

CodePudding user response：

First read in the data from the file, I'd use assert here to communicate your assumptions to people who read the code (including future you) and to confirm the assumption actually holds for the file. If not it will raise an AssertionError.

rule_file_name = 'file2.txt'
list1 = []
with open(rule_file_name) as rule_fp:
    for line in rule_fp.readlines():
        list1.append(line.replace("\n","").split(","))

assert len(list1) & 1 == 0 # confirm length is even

Then use a defaultdict to store the lists for each class.

from collections import defaultdict

classes = defaultdict(list)
for _list in list1:
    classes[_list[4]].append(_list)

Then use sample to draw pairs and confirm they aren't the same. Here I'm including a seed to make the results reproducible but you can take that out for randomness.

from random import sample, seed

seed(1) # remove this line when you want actual randomness
for key, _list in classes.items():
    assert len(_list) & 1 == 0 # each also be even else an error in data
    _list.sort(key=lambda x: x[5])
    pairs = []
    while _list:
        first = _list[-1]
        candidate = sample(_list, 1)[0]
        if first != candidate:
            print(f'first {first}, candidate{candidate}')
            pairs.append((first, candidate))
            _list.remove(first)
            _list.remove(candidate)
    classes[key] = pairs

Note that an implicit assumption in the way to do the sampling (stated in edit) is that the duplicates arise from the highest fitness values. If this is not true this could go into an infinite loop.

If you want to print them then iterate over the dictionary again:

for key, pairs in classes.items():
    print(key, pairs)

which for me gives:

1 [(['43', '44', '45', '46', '1', '0.92'], ['27', '28', '29', '30', '1', '0.67']), (['43', '44', '45', '46', '1', '0.92'], ['31', '32', '33', '34', '1', '0.84']), (['39', '40', '41', '42', '1', '0.82'], ['35', '36', '37', '38', '1', '0.45'])]
2 [(['55', '56', '57', '58', '2', '0.77'], ['51', '52', '53', '54', '2', '0.28']), (['63', '64', '65', '66', '2', '0.41'], ['59', '60', '61', '62', '2', '0.39'])]
3 [(['90', '91', '92', '93', '3', '0.97'], ['75', '76', '77', '78', '3', '0.51'])]

Using these values for file2.text-the first numbers are row numbers and not part of the actual file.

 1 27,28,29,30,1,0.67
 2 31,32,33,34,1,0.84
 3 35,36,37,38,1,0.45
 4 39,40,41,42,1,0.82
 5 43,44,45,46,1,0.92
 6 43,44,45,46,1,0.92
 7 51,52,53,54,2,0.28
 8 55,56,57,58,2,0.77
 9 59,60,61,62,2,0.39
10 63,64,65,66,2,0.41
11 75,76,77,78,3,0.51
12 90,91,92,93,3,0.97