Merge multiple columns from different files with a partial match via awk-CodePudding

I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.

By partial match I mean something like this:

File A (column 1)	File B (column 2)	A=B?
A	A?	True
A	Asd	True
B	B	True
C	c	True
C	CA	True
D	A	False

If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;

File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e

File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D

Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\

How can I do this with awk?

CodePudding user response：

An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).

An awk script like this might work:

function make_key(k) 
{
    # either return k or an uppercase version for case-insensitive keys
    # return k;
    return toupper(k);
}

BEGIN { 
    FS="\\";
}

NR==FNR {
    key=make_key($2 "\\" $1);
    if( key in notes){
        notes[key]=notes[key] ";" $3
    }
    else {
        notes[key]=$3
    }
        
}

NR!=FNR {
    for(k in notes){
         pos=index(make_key($0),k);
         if(pos==1){
            printf "%s%s%s%s\n",  $0, FS, notes[k], FS;
            next;
         }
    }
    print $0 FS;
}

You would use it like this:

awk -f script.awk file_A file_B

In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.

The NF==FNR block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.

In the NF!=FNR block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.

The BEGIN block just sets up the field separator.

CodePudding user response：

To process the first file, we could use condition NR==FNR. During the first file processing, I used two hash tables, namely pattern_ht to store column 1 values in lower case and note_ht to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.

When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht values after converting to lower case. If it matches and this is the first match, record the third_col value. If it already has another value, append to it with ";". Finally, display accordingly:

awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next} 
{third_col="";
for (key in pattern_ht) {
        split(key,date_str,";");
        if (date_str[1] == $1) {
                if (tolower($2) ~ pattern_ht[key]) { 
                        if (length(third_col) == 0) 
                                third_col = note_ht[key] 
                        else 
                                third_col = third_col ";" note_ht[key]
                }
        }
}
if (length(third_col) != 0)
        print $1"\\"$2"\\"third_col"\\"
else
        print $1"\\"$2"\\"
}' fileA fileB

Here is the output:

2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\