Home > other >  How to find common values in X files that do not occur in Y files
How to find common values in X files that do not occur in Y files

Time:01-13

I have 90 txt files with one column only. I want to find words occurring in files 1-30 but not in files 31-90.

The files are named 1.txt, 2.txt, and so on.

Is there a quick way to do this with awk, python or bash?

CodePudding user response:

A one-liner using bash, and shell utilities sort, and comm:

comm -2 -3 <(sort {1..30}.txt) <(sort {31..90}.txt)

CodePudding user response:

You might harness python's set arithmetic for this task as follows

def file_to_set(fname):
    with open("unodostres.txt","r") as f:
        return set(i.strip() for i in f)
words = file_to_set("1.txt")
for i in range(2,31):
    words = words.intersection(str(i) ".txt")
for i in range(31,91):
    words = words.difference(str(i) ".txt")
print(words)

Explanation: file_to_set read file, jettison leading and trailing whitespaces from line and convert it into set. words is created by converting 1.txt, then for 2.txt to 30.txt (range is inclusive-exclusive) I find common words between words so far and from current file and store it in words, then for 31.txt to 90.txt I remove from words all elements which are present in said files. Finally I print words.

  •  Tags:  
  • Related