I have two large files with 16000 entries that I want to iterate through, compare four variables from them and perform some calculations when there is a match. These files represent the same set of models but contain somewhat different output, thus all of the models from file 1 have a match in file 2.
File 1 and file 2 are tables where each column has a header. For instance, file 1 is
# | a1 | b1 | c1 | d1 | age |
---|---|---|---|---|---|
1 | 5 | 33 | 22.1 | 1e20 | 10 |
2 | 2 | 56 | 85.6 | 2e30 | 1 |
... | ... | ... | ... | ... | ... |
And file 2 is
# | a2 | b2 | c2 | d2 | length |
---|---|---|---|---|---|
1 | 9 | 98 | 34.8 | 3e15 | 40 |
2 | 12 | 22 | 10.2 | 5e10 | 20 |
... | ... | ... | ... | ... | ... |
Essentially, a1, b1, c1, d1 and a2, b2, c2, d2 represent the same values/models but in a different order. I want to match them and create a new table that will look like this:
# | a | b | c | d | length | age |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
Intuitively, I'd create two for loops of this type:
for i in range(len(file1)):
for j in range(len(file2)):
if a1[i] == a2[j] and b1[i]==b2[j] and c1[i]==c2[j] and d1[i]==d2[j]:
#some calculations on age and length
I wonder if there is a more robust way that would avoid having a nested for loop.
UPD: I forgot to mention that I need to match the a, b, c, d terms because they describe the model parameters.
CodePudding user response:
You can use itertools.product
, to merge those nested loops into one loop. It will still be a loop, but a bit nicer.
for (i, j) in itertools.product(range(len(file1)), range(len(file2))):
if a1[i] == a2[j] and b1[i]==b2[j] and c1[i]==c2[j] and d1[i]==d2[j]:
#some calculations stored and returned
CodePudding user response:
if your goal is to reduce nested loop but you are ok with still iterating over each file, would this work for your case?
s1 = set()
for i in range(len(file1)):
s1.add([ a1[i], b1[i], c1[i], d1[i] ])
for j in range(len(file2)):
if [ a2[j], b2[j], c2[j], d2[j] ] in s1:
perform something