Home > Net >  Removal of Duplicate data from Text file and to store in Database in Python
Removal of Duplicate data from Text file and to store in Database in Python

Time:01-09

I know that there are several answers regarding removal of duplicate data from text file but my scenario is quite different from these questions. In my case I have two text files, first one is 'file1.txt' and second one is 'file2.txt' and their respective data are:

file1.txt:

admin:2222
admin:meunsm
admin:12345
stack:0000
csanders:1111

sequence in file1.txt is:

username:password

file2.txt:

192.168.0.114:1137   >   192.168.0.193:21 csanders:echo

sequence in file2.txt is:

source ip: source port > destination ip: destination port username:password

Now, the situation is that I am comparing these two text files data in python and my case is if file1.txt data doesn't exist in file2.txt then new result must be store in newfile.txt and output in newfile.txt does contain only username and nothing else:

Here is the code which is used for giving me this output in newfile.txt

testing.py:

with open('file1.txt') as file1:
    with open('file2.txt') as file2:
        newfile = open('newfile.txt', 'w')
        f1_lines = file1.readlines()
        f2_lines = file2.readlines()
        different_lines = []
        
        for line1 in f1_lines:
            init = False
            for line2 in f2_lines:
                if line1 in line2:
                    init = True
            if not init:
                 different_lines.append(line1)
            var = ""
            for line in different_lines:
                new_var = f"{line.strip().split(':').[0]}\n" 
                var  = new_var
                newfile.write(new_var)
            print(var)
            newfile.close()

Now, I want that output data in newfile.txt wouldn't be repeat at any situation. and the output should come like this:

newfile.txt:

admin
stack

CodePudding user response:

Type converting a list (in this example) into a set will remove any duplicate data (as commented by @mkrieger1), as duplicate data is not allowed in a set. Doing this before your comparisons will also help performance wise (in this example). So something like this might help you:

with open('file1.txt', 'r') as file1:
    f1_lines = file1.readlines()
with open('file2.txt', 'r') as file2:
    f2_lines = file2.readlines()

for index, line in enumerate(f1_lines):
    f1_lines[index] = line.strip().split(':')[0].lower()  # Just want the usernames
f1_lines = set(f1_lines)  # Remove duplicate usernames (if any) from file1 user list by type converting to a set

for index, line in enumerate(f2_lines):
    f2_lines[index] = line.strip().split()[-1]  # Get the username:password segment
    f2_lines[index] = f2_lines[index].split(':')[0].lower()  # Just want the username from that segment
f2_lines = set(f2_lines)  # Remove duplicate usernames (if any) from file2 user list by type converting to a set

output_file = []
for f1_line in f1_lines:
    for f2_line in f2_lines:
        if f1_line == f2_line:
            break
    else:
        # f2_lines iterables exhausted which means no match was found, so username goes to output_file
        output_file.append(f"{f1_line}\n")
output_file = sorted(output_file)

with open('newfile.txt', 'w') as f:
    f.writelines(output_file)
  • Related