Case and accent insensitive finding of similar lines in two files-CodePudding

^{Updated input samples, code and error message}

I got two unquoted and single column TSV files (exported from a database) with a few thousand people names and I need to find the names that appear in both files. Both files are UTF-8, CRLF terminated, and start with the BOM 0xEF 0xBB 0xBF.

A simple join or comm command could have done the trick but there are a few differences in the names:

# cat file1.tsv
A.  Einstein
Louis Pasteur  
Diego Armando Maradona
Isaac Newton
 Fräva Dona
D Rüge
Françoise Barré-Sinoussi

# cat file2.tsv
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava dona
Marie-Louise Von FRANZ
Dimitri Rüge

The expected matches in file2.tsv would be:

Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava dona
Dimitri Rüge

I've wrote this bash sed awk grep script that dynamically generates a regex for matching the last names:

#!/bin/bash

# U 0300 = 0xCC80 = 52352
# U 033F = 0xCCBF = 52415
# U 0340 = 0xCD80 = 52608
# U 036E = 0xCDAE = 52654

_COMBINING_CHARS_=()

for i in {52352..52415} {52608..52654}
do
    hex=$(printf X "$i")
    _COMBINING_CHARS_ =( "$(printf '\x'"${hex:0:2}"'\x'"${hex:2:2}")" )
done

_COMBINING_CHARS_ERE_=$(IFS='|'; printf %s "${_COMBINING_CHARS_[*]}")

# Function that removes the BOM, CRLF, and COMBINING characters:
sanitize() {
    LANG=C sed -E \
        -e $'1s/^\xEF\xBB\xBF//' \
        -e $'s/\r$//' \
        -e "s/$_COMBINING_CHARS_ERE_//g" \
    -- "$@"
}

# Function that generates a regex for the _lastname_:
toERE() {
    awk '
        {
            if ( $0 ~ /,/) {
                n = split($0, a, ",");
                $0 = a[n];
            } else {
                $0 = $NF
            }
            sub("^[[:space]] ","");
            sub("[[:space]] $","");
            gsub("[[:space:]-] "," ");
        }

        {
            ere = ""
            sep = "";
            for ( nf = 1; nf <= NF; nf   ) {
                n = split($nf, c, "");
                for ( i = 1; i <= n; i   ) {
                    ere = ere "[[=" c[i] "=]]"
                }
                ere = sep ere 
                sep = "[[:space:]-] "
            }
            print ere "[[:space:]]*$"
        }
    ' < <(sanitize "$@")
}

grep -E -f <(toERE "$1") <(sanitize "$2")

While the code works in some of my test-cases, the result with the given input is:

./joinlastnames.sh file1.tsv file2.tsv
grep: illegal byte sequence

UTF-8 multibyte characters seems to be the problem but I can't think of a way to handle it with awk...

AL: Should I drop bash/awk and switch to an other scripting language? The target OS is macOS so there are a few possible choices readily available; but I would like to keep it independent so that the users (that don't know anything about shell and programming) don't have to install anything but the script.

CodePudding user response：

How about agrep? man agrep: agrep - search a file for a string or regular expression, with approximate matching capabilities. It's not perfect like we will see:

$ while IFS= read -r line
do 
    echo -n "$line: "
    agrep -B -y  "$line" file1
done < file2

Output:

Diego A. Maradona: agrep: 1 word matches within 6 errors
Maradona, Diego Armando
Albert Einstein: agrep: 1 word matches within 5 errors
A. Einstein
Louis Pasteur: Louis Pasteur
frava dona: agrep: 2 words match within 4 errors
Maradona, Diego Armando
Fräva Dona

Nice sample as we can already see a problem in the last three lines.

CodePudding user response：

Suggesting the following trick:

 cat file1.csv file1.csv | sort | uniq -d

Explanation

cat file1.csv file1.csv combine bot files one after the other

sort put similar lines together

uniq -d print only line that have duplicates