Home > Software design >  Case and accent insensitive finding of similar lines in two files
Case and accent insensitive finding of similar lines in two files

Time:02-16

Updated input samples, code and error message

I got two unquoted and single column TSV files (exported from a database) with a few thousand people names and I need to find the names that appear in both files. Both files are UTF-8, CRLF terminated, and start with the BOM 0xEF 0xBB 0xBF.

A simple join or comm command could have done the trick but there are a few differences in the names:

# cat file1.tsv
A.  Einstein
Louis Pasteur  
Diego Armando Maradona
Isaac Newton
 Fräva Dona
D Rüge
Françoise Barré-Sinoussi
# cat file2.tsv
Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava dona
Marie-Louise Von FRANZ
Dimitri Rüge

The expected matches in file2.tsv would be:

Diego Maradona
Albert Einstein
Francoise, BARRE SINOUSSI
Louis Pasteur
frava dona
Dimitri Rüge

I've wrote this bash sed awk grep script that dynamically generates a regex for matching the last names:

#!/bin/bash

# U 0300 = 0xCC80 = 52352
# U 033F = 0xCCBF = 52415
# U 0340 = 0xCD80 = 52608
# U 036E = 0xCDAE = 52654

_COMBINING_CHARS_=()

for i in {52352..52415} {52608..52654}
do
    hex=$(printf X "$i")
    _COMBINING_CHARS_ =( "$(printf '\x'"${hex:0:2}"'\x'"${hex:2:2}")" )
done

_COMBINING_CHARS_ERE_=$(IFS='|'; printf %s "${_COMBINING_CHARS_[*]}")
# Function that removes the BOM, CRLF, and COMBINING characters:
sanitize() {
    LANG=C sed -E \
        -e $'1s/^\xEF\xBB\xBF//' \
        -e $'s/\r$//' \
        -e "s/$_COMBINING_CHARS_ERE_//g" \
    -- "$@"
}
# Function that generates a regex for the _lastname_:
toERE() {
    awk '
        {
            if ( $0 ~ /,/) {
                n = split($0, a, ",");
                $0 = a[n];
            } else {
                $0 = $NF
            }
            sub("^[[:space]] ","");
            sub("[[:space]] $","");
            gsub("[[:space:]-] "," ");
        }

        {
            ere = ""
            sep = "";
            for ( nf = 1; nf <= NF; nf   ) {
                n = split($nf, c, "");
                for ( i = 1; i <= n; i   ) {
                    ere = ere "[[=" c[i] "=]]"
                }
                ere = sep ere 
                sep = "[[:space:]-] "
            }
            print ere "[[:space:]]*$"
        }
    ' < <(sanitize "$@")
}
grep -E -f <(toERE "$1") <(sanitize "$2")

While the code works in some of my test-cases, the result with the given input is:

./joinlastnames.sh file1.tsv file2.tsv
grep: illegal byte sequence

UTF-8 multibyte characters seems to be the problem but I can't think of a way to handle it with awk...


AL: Should I drop bash/awk and switch to an other scripting language? The target OS is macOS so there are a few possible choices readily available; but I would like to keep it independent so that the users (that don't know anything about shell and programming) don't have to install anything but the script.

CodePudding user response:

How about agrep? man agrep: agrep - search a file for a string or regular expression, with approximate matching capabilities. It's not perfect like we will see:

$ while IFS= read -r line
do 
    echo -n "$line: "
    agrep -B -y  "$line" file1
done < file2

Output:

Diego A. Maradona: agrep: 1 word matches within 6 errors
Maradona, Diego Armando
Albert Einstein: agrep: 1 word matches within 5 errors
A. Einstein
Louis Pasteur: Louis Pasteur
frava dona: agrep: 2 words match within 4 errors
Maradona, Diego Armando
Fräva Dona

Nice sample as we can already see a problem in the last three lines.

CodePudding user response:

Suggesting the following trick:

 cat file1.csv file1.csv | sort | uniq -d

Explanation

cat file1.csv file1.csv combine bot files one after the other

sort put similar lines together

uniq -d print only line that have duplicates

  • Related