How do I compare lines in two files WITHOUT respect to their position in those files (set difference-CodePudding

diff and similar tools seem to compare files, not content that happens to be in the form of lines in files. That is, they consider the position of each line in the file as significant and part of the comparison.

What about when you just don't care about position? I simply want to compare two lists in more like a set operation without any respect to position. Here each line can be considered a list element. So, I'm looking for what is the difference between lines in file1 and file2, and file2 and file1.

I don't want to see positional information, or do any a pairwise compariosn, just a result set for each operation. For example:

SET1: a b c d f g

SET2: a b c e g h

SET1 - SET2 = d f

SET2 - SET1 = e g

Can I do this easily in bash? Obviously it's fine to sort the list first or not but sorting is not intrinsically a prerequisute to working with sets

CodePudding user response：

If your fields are all separated by spaces, you may use a four steps process:

split
sort
compare
aggregate

tr " " "\n" file1 | sort > file1.tmp
tr " " "\n" file2 | sort > file2.tmp
diff file1.tmp file2.tmp | tee /tmp/result

and now, aggregate the results:

echo "file1: $(diff file1.tmp file2.tmp | grep ">" | tr -d \> | tr "\n" " " )"
echo "file2: $(diff file1.tmp file2.tmp | grep "<" | tr -d \< | tr "\n" " " )"
rm /tmp/result file1.tmp file2.tmp

Eventually, to match exactly your field format, append a | tr -s " " to your two echoes.

CodePudding user response：

Assuming you want to do full-line string comparisons and consider counts of lines rather than just appearances of lines as differences, this might do what you want (untested):

awk '
    NR==FNR {
        set1[$0]  
        next
    }
    $0 in set1 {
        both[$0]  
        if ( --set1[$0] == 0 ) {
            delete set1[$0]
        }
        next
    }
    {
        set2[$0]  
    }
    END {
        for ( str in both ) {
            printf "Both: %s (%d)\n", str, both[str]
        }
        for ( str in set1 ) {
            printf "Set1: %s (%d)\n", str, set1[str]
        }
        for ( str in set2 ) {
            printf "Set2: %s (%d)\n", str, set2[str]
        }
    }
' file1 file2