I have a file say : file1.txt
, which has multiple rows and columns. I want to read that and store that as list of lists. Now I want to pair them using the logic, no 2 same rows can be in a pair. Now the 2nd last
column represent the class
. Below is my file:
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
Here all the 6 rows are class 1. I am using below logic to do this pairing part.
from operator import itemgetter
rule_file_name = 'file1.txt'
rule_fp = open(rule_file_name)
list1 = []
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
list1=sorted(list1,key=itemgetter(-1),reverse=True)
length = len(list1)
middle_index = length // 2
first_half = list1[:middle_index]
second_half = list1[middle_index:]
result=[]
result=list(zip(first_half,second_half))
for a,b in result:
if a==b:
result.remove((a, b))
print(result)
print("-------------------")
It is working absolutely fine when I have one class only. But if my file has multiple classes then I want the pairing to be done with is the same class only. For an example if my file looks like below: say file2
27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97
Then I want to make 3 pairs from class 1, 2 from class 2 and 1 from class 3.Then I am using this logic to make the dictionary where the keys will be the classes.
d = {}
sorted_grouped = []
for row in list1:
# Add name to dict if not exists
if row[-2] not in d:
d[row[-2]] = []
# Add all non-Name attributes as a new list
d[row[-2]].append(row)
#print(d.items())
for k,v in d.items():
sorted_grouped.append(v)
#print(sorted_grouped)
gp_vals = {}
for i in sorted_grouped:
gp_vals[i[0][-2]] = i
print(gp_vals)
Now how can I do it, please help !
My desired output for file2
is:
[([43,44,45,46,1,0.92], [39,40,41,42,1,0.82]), ([43,44,45,46,1,0.92], [27,28,29,30,1,0.67]), ([31,32,33,34,1,0.84], [35,36,37,38,1,0.45])] [([55,56,57,58,2,0.77], [59,60,61,62,2,0.39]), ([63,64,65,66,2,0.41], [51,52,53,54,2,0.28])] [([90,91,92,93,3,0.97], [75,76,77,78,3,0.51])]
Edit1:
All the files will have even number of rows, where every class will have even number of rows as well.
For a particular class(say class 2), if there are
n
rows then there can be maximumn/2
identical rows for that class in the dataset.My primary intention was to get random pairing but making sure no self pairing is allowed. For that I thought of taking the row with the highest fitness value(The last column) inside any class and take any other row from that class randomly and make a pair just by making sure both the rows are not exactly the same. And this same thing is repeated for every class separately.
CodePudding user response:
First read in the data from the file, I'd use assert
here to communicate your assumptions to people who read the code (including future you) and to confirm the assumption actually holds for the file. If not it will raise an AssertionError
.
rule_file_name = 'file2.txt'
list1 = []
with open(rule_file_name) as rule_fp:
for line in rule_fp.readlines():
list1.append(line.replace("\n","").split(","))
assert len(list1) & 1 == 0 # confirm length is even
Then use a defaultdict to store the lists for each class.
from collections import defaultdict
classes = defaultdict(list)
for _list in list1:
classes[_list[4]].append(_list)
Then use sample to draw pairs and confirm they aren't the same. Here I'm including a seed to make the results reproducible but you can take that out for randomness.
from random import sample, seed
seed(1) # remove this line when you want actual randomness
for key, _list in classes.items():
assert len(_list) & 1 == 0 # each also be even else an error in data
_list.sort(key=lambda x: x[5])
pairs = []
while _list:
first = _list[-1]
candidate = sample(_list, 1)[0]
if first != candidate:
print(f'first {first}, candidate{candidate}')
pairs.append((first, candidate))
_list.remove(first)
_list.remove(candidate)
classes[key] = pairs
Note that an implicit assumption in the way to do the sampling (stated in edit) is that the duplicates arise from the highest fitness values. If this is not true this could go into an infinite loop.
If you want to print them then iterate over the dictionary again:
for key, pairs in classes.items():
print(key, pairs)
which for me gives:
1 [(['43', '44', '45', '46', '1', '0.92'], ['27', '28', '29', '30', '1', '0.67']), (['43', '44', '45', '46', '1', '0.92'], ['31', '32', '33', '34', '1', '0.84']), (['39', '40', '41', '42', '1', '0.82'], ['35', '36', '37', '38', '1', '0.45'])]
2 [(['55', '56', '57', '58', '2', '0.77'], ['51', '52', '53', '54', '2', '0.28']), (['63', '64', '65', '66', '2', '0.41'], ['59', '60', '61', '62', '2', '0.39'])]
3 [(['90', '91', '92', '93', '3', '0.97'], ['75', '76', '77', '78', '3', '0.51'])]
Using these values for file2.text
-the first numbers are row numbers and not part of the actual file.
1 27,28,29,30,1,0.67
2 31,32,33,34,1,0.84
3 35,36,37,38,1,0.45
4 39,40,41,42,1,0.82
5 43,44,45,46,1,0.92
6 43,44,45,46,1,0.92
7 51,52,53,54,2,0.28
8 55,56,57,58,2,0.77
9 59,60,61,62,2,0.39
10 63,64,65,66,2,0.41
11 75,76,77,78,3,0.51
12 90,91,92,93,3,0.97