Compare two files using first column, print diff while appending column to output-CodePudding

I have two tab-delimited files that I want to compare.

I want to look for values of col1 in file1 that are missing in file2 (prepend with "left") and values in col1 in file2 that are missing in file1 (prepend with "joined"). For these lines, I want to print col1 and col8. My diff command fails for lines with an equal col1 and different values for col8.

file1:

Char1   55  Necromancer Knight  A   11/21/21    Zone    Char1(Main)     off off 0       Char1(Main) 
Char2   28  Druid   Squire  A   12/08/21    Zone    Char1       off off 0       Char1   
Char3   44  Enchanter   Recruit A   08/07/21    Zone    Char3(Main)     off off 0       Char3(Main) 
Char4   56  Enchanter   Knight  A   11/06/21    Zone    Char4(Main)     off off 0       Char4(Main) 
Char5   10  Magician    Recruit A   10/29/21    Zone    Char1       off off 0       Char1   
Char6   65  Druid   Champion    A   12/12/21    The Lair of the Splitpaw    Char6(Main) VT emp time     off off 0       Char6(Main) VT emp time

file2:

Char1   55  Necromancer Knight  A   11/21/21    Zone    Char1(Main)     off off 0       Char1(Main) 
Char2   28  Druid   Squire  A   12/08/21    Zone    Char1       off off 0       Char1   
Char3   44  Enchanter   Recruit A   08/07/21    Zone    Char3(Main)     off off 0       Char3(Main) 
Char4   56  Enchanter   Knight  A   11/06/21    Zone    Char4(Main)     off off 0       Char4(Main) 
Char5a  10  Magician    Recruit A   10/29/21    Zone    Char1       off off 0       Char1   
Char6   65  Druid   Champion    A   12/21/21    Zone    Char6(Main) Emp/VT/Time     off off 0       Char6(Main) Emp/VT/Time

diff command producing output:

diff --new-line-format="joined %L" --old-line-format="left %L" --unchanged-line-format="" <(cut -f1,8 "$file1" | sort) <(cut -f1,8 "$file2" | sort) | sort

Current output:

joined Char5a   Char1
joined Char6    Char6(Main) Emp/VT/Time
left Char5      Char1
left Char6      Char6(Main) VT emp time

Desired output:

joined Char5a   Char1
left Char5      Char1

Any help is greatly appreciated, thank you!

CodePudding user response：

Diff is for simple differences. A solution for your problem is easier with awk.
You do not need to sort the files, just read both files and count how often the keys in column 1 occur.
While counting the key values prepare the potential output. When NR==FNR you are in the first file and you want to use "left".
When both files are read, look for keys with only one occurance and print the prepared output for that line.

awk -F'\t' '
  {
    key[$1]  
    value[$1]=(NR==FNR ? "left " : "joined ") $1 "\t" $8
  }
  END { 
    for (i in key) { 
      if (key[i]==1) { 
        print value[i]
      }
    }
  }' file1 file2

CodePudding user response：

I am not really versed with awk.

But I think it would be someting like this:

awk -F' ' 'NR==FNR{x[$1]  ;next};x[$1] > 0' file1 file2

Checkout this post. There you'll find some more examples on txt file comparison

CodePudding user response：

Hope this python script gets what you are expecting :

#!/bin/bash

python -c '
import re, sys
def read_into_dict(regex, file):
    return { re.findall(regex, line)[0]: line for line in open(file) }
regex, file1, file2 = sys.argv[1:]
dict1 = read_into_dict(regex, file1)
dict2 = read_into_dict(regex, file2)
keys1 = set(dict1.keys())
keys2 = set(dict2.keys())
for key in (keys1 - keys2):
    print(f"left {dict1[key]}")
for key in (keys2 - keys1):
    print(f"joined {dict2[key]}")
' '^(\S )' <(cut -f1,8 "$file1" | sort) <(cut -f1,8 "$file2" | sort) | sort