Finding duplicates in two separate txt files line by line and print only duplicates-CodePudding

Tldr: Open two txt files, use one to search the other and then print any duplicates.

Hi everyone, first time posting on here and very new to coding and python, I'm searching for an answer and unable to find anything that uses .txt files like I'm trying to do. I am trying to search for a group of strings or single string in test2 using the file test. The reason for me using txt files is it would be impossible for me to have to manually input each value into a list in python as the files have thousands of different strings to search through.

from itertools import chain

f1 = open(r"test.txt", "r")
f2 = open(r"test2.txt", "r")
file1 = f1.read().splitlines()
file2 = f2.read().splitlines()
x = [file1]
y = [file2]
z = list(chain([x,y]))
z.sort()
d = (x for x in z if z.count (x) > 1)
print (d)
f1.close()
f2.close()

The result I get is this:

<generator object <genexpr> at 0x7f10cc992420>

I understand that I should be getting a print out of any duplicates that are found from the combined list I created with list(chain()). Any help or suggestions would be greatly appreciated!

CodePudding user response：

Expanding on my comment. It seems like you are just willy-nilly tossing square brackets around things hoping things will work, but in every instance you are using square brackets, you shouldn't be.

.splitlines() already returns a list. You don't have to take that return and put it inside of another list.

.chain() takes two lists as arguments, so sticking your two lists inside of yet another list and passing that as a single argument isn't going to do what you want.

This is all pretty easy stuff to catch as mistakes with some basic debugging. For instance, if you would have tossed a print(x) after setting that variable you would have found it prints [['stuff','from','file','1']]. Same with y [['stuff','from','file','2']]. You have a list inside of another list.

You could also do this for the argument you pass into chain(). print([x,y]) would show [[['stuff','from','file','1']],[['stuff','from','file','2']]] list-ception.

Lastly, the one spot you probably want to use square brackets is in your list comprehension. Instead of parentheses, switch to square brackets.

Instead:

from itertools import chain

f1 = open(r"test.txt", "r")
f2 = open(r"test2.txt", "r")
file1 = f1.read().splitlines()
file2 = f2.read().splitlines()
z = list(chain(file1,file2))
z.sort()
d = [x for x in z if z.count (x) > 1]
print (d)
f1.close()
f2.close()

This will spit out ['match','match'] (assuming the one word that matches in both files is the word 'match').

CodePudding user response：

To remove duplicates you can use set

f1 = open(r"test.txt", "r")
f2 = open(r"test2.txt", "r")
file1 = f1.read().splitlines()
file2 = f2.read().splitlines()
d = list(set(file1 file2)) # Combine and remove duplcates
print(d)
f1.close()
f2.close()

NB: (x for x in z if z.count (x) > 1) you are creating a generater here, you might be looking for list comprehension which will be look like this

[x for x in z if z.count (x) > 1].