Home > Back-end >  Delete in Python duplicate reverse strings from a text file
Delete in Python duplicate reverse strings from a text file

Time:10-19

I have this file (with thousands of lines). Each line contains two numbers separated by whitespace:

3466    937
3466    5233
3466    8579
3466    10310
3466    15931
3466    17038
3466    18720
3466    19607
10310   1854
10310   3466
10310   4583
10310   5233
10310   9572
10310   10841
10310   13056
10310   14982
10310   16310

and I have to delete in python lines that are repeated in reverse order, namely 10310 3466 and 3466 10310 should appear only as one line (either 10310 3466 or 3466 10310). Any ideas? Thank you.

CodePudding user response:

One approach is to use frozenset to generate keys that are order insensitive:

# change data.csv to the name of your file
with open("data.csv") as infile:
    uniques = set(frozenset(line.strip().split()) for line in infile)
    for value in uniques:
        print(*value)

Output (for input given)

10310 3466
5233 10310
10310 4583
19607 3466
1854 10310
3466 8579
10310 9572
10310 13056
10310 14982
5233 3466
17038 3466
15931 3466
10310 10841
937 3466
18720 3466
16310 10310

Alternative, using sorted to convert each line to the same key:

# change data.csv to the name of your file
with open("data.csv") as infile:
    uniques = set(" ".join(sorted(line.strip().split())) for line in infile)
    for value in uniques:
        print(value)

To better understand the approach using frozenset, see the code below:

frozenset((1, 2)) == frozenset((2, 1))
Out[2]: True

As it can be seen two frozenset are equals independent of the order of the tuples used as input. This happens for regular sets also but frozensets are hashable, from the documentation:

The frozenset type is immutable and hashable — its contents cannot be altered after it is created; it can therefore be used as a dictionary key or as an element of another set.

Note

To write the de-duplicated lines to a new file do:

# change data.csv to the name of your file
with open("data.csv") as infile:
    uniques = set(frozenset(line.strip().split()) for line in infile)

    # change output.csv to the name of your output file
    with open("output.csv", mode="w") as outfile:
        for value in uniques:
            outfile.write(f'{" ".join(value)}\n')

CodePudding user response:

It seems order of the numbers is not important so you could do like this:

filename='data.txt'

list=[]

with open(filename) as file:
    lines = file.readlines()
    for line in lines:
        nums=line.split(' ')
        nums = ' '.join(nums).split()
        a,b=int(nums[0]),int(nums[1])
        min=a
        max=b
        if b<a:
            min=b
            max=a
        list.append(str(min) ' ' str(max))

uniqueSet=set(list)
with open("output.txt", mode="w") as outfile:
    for l in uniqueSet:
        outfile.write(l '\n')
  • Related