How do I compare every line of the 1st text file to every line of the 2nd text file in Python?-CodePudding

I have 2 text files named f1 & f2 with 100k lines of names each. I want to compare the first line of f1 with every line of f2, then the second line of f1 with every line of f2, and so on. I already tried using nested for loop like code below but it doesn't work.

What am I doing wrong I can't seem to find? Please can someone tell me?

Thanks in advance.

old.txt

sourcreameggnest
saturnnixgreentea
saxophonedesertham
footballplumvirgo
soybeansthesting
cauliflowertornado
sourcreameggnest
saturnnixgreentea

new.txt

goldfishpebbleduck
saxophonedesertham
footballplumvirgo
abloomtheavengers
venisonflowersea
goodfellaswalker
saturnnixgreentea

Code:

 with open('old.txt', 'r') as f1, open('new.txt', 'r') as f2:
    
    for line1 in f1:
        print('Line 1:- '   line1, end='')
        
        for line2 in f2:
            print('Line 2:- '   line2, end='')
            
            if line1.strip() == line2:
                print("Inside comparison"   line1, end='')

Output:

Line 1:- goldfishpebbleduck
Line 2:- sourcreameggnest
Line 2:- saturnnixgreentea
Line 2:- saxophonedesertham
Line 2:- footballplumvirgo
Line 2:- soybeansthesting
Line 2:- cauliflowertornado
Line 2:- sourcreameggnest
Line 2:- saturnnixgreentea
Line 1:- saxophonedesertham
Line 1:- footballplumvirgo
Line 1:- abloomtheavengers
Line 1:- venisonflowersea
Line 1:- goodfellaswalker
Line 1:- saturnnixgreentea

CodePudding user response：

Considering the number of lines in the files I would entirely avoid the nested loop (O(n^2)) approach and load the lines of the second text file in a dictionary (if you care about the lines and/or the lines could be repeated), or in a set otherwise.

Then I would loop over the lines in the first file and check whether they are in the dictionary and act accordingly. This will use some extra space linear to the number of lines in the second file but reduce your time complexity to O(n) since dictionary lookups are constant.

As to your current solution's incorrectness, as pointed out by @Thierry Lathuille, the second iterator is exhausted after the first run of the outer loop, so it won't be checked for the remaining iterations. On mitigation is to read the lines of the file into a list where you can repeatedly loop over (lines1 = f1.readlines(); lines2 = f2.readlines()). Also, you use of strip is not correct if you intend to avoid whitespace lines. They will still be compared as empty strings with the added downside that stripping one line and not the other can create unwanted differences.

In any case, for such large numbers, an approach of quadratic time complexity is not feasible.

CodePudding user response：

You already read to the end of the file after the first outer loop. Btw, I didn't know you could just loop over an opened file. Just store the lines first. Also I don't see why you would strip the '\n' only from one of the lines.

 with open('old.txt', 'r') as f1, open('new.txt', 'r') as f2:
    lines1 = f1.readlines()
    lines2 = f2.readlines()
    for line1 in lines1:
        print('Line 1:- '   line1, end='')
        
        for line2 in lines2:
            print('Line 2:- '   line2, end='')
            
            if line1 == line2:
                print("Inside comparison"   line1, end='')