Home > Back-end >  Merge multiple columns from different files with a partial match via awk
Merge multiple columns from different files with a partial match via awk

Time:10-24

I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.

By partial match I mean something like this:

File A (column 1) File B (column 2) A=B?
A A? True
A Asd True
B B True
C c True
C CA True
D A False

If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;

File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e

File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D

Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\

How can I do this with awk?

CodePudding user response:

An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).

An awk script like this might work:

function make_key(k) 
{
    # either return k or an uppercase version for case-insensitive keys
    # return k;
    return toupper(k);
}

BEGIN { 
    FS="\\";
}

NR==FNR {
    key=make_key($2 "\\" $1);
    if( key in notes){
        notes[key]=notes[key] ";" $3
    }
    else {
        notes[key]=$3
    }
        
}

NR!=FNR {
    for(k in notes){
         pos=index(make_key($0),k);
         if(pos==1){
            printf "%s%s%s%s\n",  $0, FS, notes[k], FS;
            next;
         }
    }
    print $0 FS;
}

You would use it like this:

awk -f script.awk file_A file_B

In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.

The NF==FNR block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.

In the NF!=FNR block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.

The BEGIN block just sets up the field separator.

CodePudding user response:

To process the first file, we could use condition NR==FNR. During the first file processing, I used two hash tables, namely pattern_ht to store column 1 values in lower case and note_ht to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.

When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht values after converting to lower case. If it matches and this is the first match, record the third_col value. If it already has another value, append to it with ";". Finally, display accordingly:

awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next} 
{third_col="";
for (key in pattern_ht) {
        split(key,date_str,";");
        if (date_str[1] == $1) {
                if (tolower($2) ~ pattern_ht[key]) { 
                        if (length(third_col) == 0) 
                                third_col = note_ht[key] 
                        else 
                                third_col = third_col ";" note_ht[key]
                }
        }
}
if (length(third_col) != 0)
        print $1"\\"$2"\\"third_col"\\"
else
        print $1"\\"$2"\\"
}' fileA fileB

Here is the output:

2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
  • Related