I have 90 txt files with one column only. I want to find words occurring in files 1-30 but not in files 31-90.
The files are named 1.txt, 2.txt, and so on.
Is there a quick way to do this with awk, python or bash?
CodePudding user response:
A one-liner using bash
, and shell utilities sort
, and comm
:
comm -2 -3 <(sort {1..30}.txt) <(sort {31..90}.txt)
CodePudding user response:
You might harness python
's set
arithmetic for this task as follows
def file_to_set(fname):
with open("unodostres.txt","r") as f:
return set(i.strip() for i in f)
words = file_to_set("1.txt")
for i in range(2,31):
words = words.intersection(str(i) ".txt")
for i in range(31,91):
words = words.difference(str(i) ".txt")
print(words)
Explanation: file_to_set
read file, jettison leading and trailing whitespaces from line and convert it into set. words
is created by converting 1.txt
, then for 2.txt
to 30.txt
(range
is inclusive-exclusive) I find common words between words so far and from current file and store it in words
, then for 31.txt
to 90.txt
I remove from words
all elements which are present in said files. Finally I print
words.