Removing duplicates from text file using python-CodePudding

I have this text file and let's say it contains 10 lines.

Bye
Hi
2
3
4
5
Hi
Bye
7
Hi

Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said. My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)

text_file = open(filename) 
for i, line in enumerate(text_file):
    if i == 0:
       var_Line1 = line
    if i = 1:
       var_Line2 = line
    if i > 1: 
       if line == var_Line2:
          del line
text_file.close()

It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well

CodePudding user response：

You could use dict.fromkeys to remove duplicates and preserve order efficiently:

with open(filename, "r") as f:
    lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
    f.writelines(lines)

Idea from Raymond Hettinger

CodePudding user response：

Using a set & some basic filtering logic:

with open('test.txt') as f:
    seen = set()  # keep track of the lines already seen
    deduped = []
    for line in f:
        line = line.rstrip()
        if line not in seen:  # if not seen already, write the lines to result
            deduped.append(line)
        seen.add(line)

# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
    f.writelines([l   '\n' for l in deduped])