How to compare filenames in two text files on Linux bash?-CodePudding

I have two lists list1 and list2 with a filename on each line. I want a result with all filenames that are only in list2 and not in list1, regardless of specific file extensions (but not all). Using Linux bash, any commands that do not require any extra installations. In the example lists, I do know all file extensions that I wish to ignore. I made an attempt but it does not work at all, I don't know how to fix it. Apologies for my inexperience.

I wish to ignore the following extensions: .x .xy .yx .y .jpg

list1.txt

text.x
example.xy
file.yx
data.y
edit
edit.jpg

list2.txt

text
rainbow.z
file
data.y
sunshine
edit.test.jpg
edit.random

result.txt

rainbow.z
sunshine
edit.test.jpg
edit.random

My try:

while read LINE
    do
    line2=$LINE
    sed -i 's/\.x$//g' $LINE $line2
    sed -i 's/\.xy$//g' $LINE $line2
    sed -i 's/\.yx$//g' $LINE $line2
    sed -i 's/\.y$//g' $LINE $line2 
    then sed -i -e '$line' result.txt;
    fi
done < list2.txt

Edit: I forgot two requirements. The filenames can have . in them and not all filenames must have an extension. I know the extensions that must be ignored. I ammended the lists accordingly.

CodePudding user response：

An awk solution might be more efficient for this task:

awk '
              { f=$0; sub(/\.(xy?|yx?|jpg)$/,"",f) }
    NR==FNR   { a[f]; next }
    !(f in a)
' list1.txt list2.txt > result.txt

CodePudding user response：

comm can do precisely this.

You can preprocess the input:

strip the suffices
sort (comm expects sorted input)
remove duplicates

ss()( sed 's/\.\(x\|xy\|yx\|y\|jpg\)$//' "$@" | sort -u )

comm -13 <(ss list1.txt) <(ss list2.txt) >result.txt

Your code was:

while read LINE
    do
    line2=$LINE
    sed -i 's/\.x$//g' $LINE $line2
    sed -i 's/\.xy$//g' $LINE $line2
    sed -i 's/\.yx$//g' $LINE $line2
    sed -i 's/\.y$//g' $LINE $line2 
    then sed -i -e '$line' result.txt;
    fi
done < list2.txt

Some issues that immediately jump out:

syntax error - then/fi but no matching if
you never access list1
you don't quote variables when you use them, so whitespace and special characters will cause problems
while read ... sed ... sed ... sed ... is inefficient - multiple invocations of sed instead of just one, and a loop that sed would perform implicitly
sed expects file arguments not strings
sed -i will try to overwrite input file arguments
you use result.txt as both input and output to sed but never assign any contents to it
you try to use data ($line) as sed commands, instead of applying sed commands to that data
because you used single-quotes, sed -i -e '$line' will attempt to run a (non-existent) sed command line on the last line of input ($)
g option to s/// does nothing when search is anchored

CodePudding user response：

I'd use join:

$ join -t. -j1 -v2 -o 2.1,2.2 <(sort list1.txt) <(sort list2.txt) | sed 's/\.$//'
rainbow.z
sunshine

(The bit of sed is needed to turn sunshine. into sunshine)