I have two files, A and B with the columns separated by \.
Column 2 of file A is exactly the same as column 1 of file B.
I want to merge these two files keeping file B the same, add a new column based on the same fields between the two files and a partial match between column 1 of file A and column 2 of file B.
By partial match I mean something like this:
File A (column 1) | File B (column 2) | A=B? |
---|---|---|
A | A? | True |
A | Asd | True |
B | B | True |
C | c | True |
C | CA | True |
D | A | False |
If there are values with the same column 1 and 2 in file A, they must be added to file B separated by ;
File A
A\2022.10.10\note a
A\2022.10.10\note b
B\2022.10.14\note c
A\2022.10.14\note d
C\2022.10.15\note e
File B
2022.10.10\A?
2022.10.14\B?
2022.10.14\a
2022.10.15\C
2022.10.15\D
Desired output
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\
How can I do this with awk?
CodePudding user response:
An awk script might work depending on the details of your requirements (e.g. of the keys are case sensitive or not).
An awk script like this might work:
function make_key(k)
{
# either return k or an uppercase version for case-insensitive keys
# return k;
return toupper(k);
}
BEGIN {
FS="\\";
}
NR==FNR {
key=make_key($2 "\\" $1);
if( key in notes){
notes[key]=notes[key] ";" $3
}
else {
notes[key]=$3
}
}
NR!=FNR {
for(k in notes){
pos=index(make_key($0),k);
if(pos==1){
printf "%s%s%s%s\n", $0, FS, notes[k], FS;
next;
}
}
print $0 FS;
}
You would use it like this:
awk -f script.awk file_A file_B
In the function make_key you can configure the case sensitiveness by either returning k or an uppercase version.
The NF==FNR
block is used during reading the first file (file_A) here the notes are stored under a key made out of the second and first column. Notes are appended if the key is already existing.
In the NF!=FNR
block the second file (file_B) is read. Here we print the line of file_b and the matching notes by comparing every key if the line of file_B starts with the key. If no key matches, then only the line of file_B is printed.
The BEGIN
block just sets up the field separator.
CodePudding user response:
To process the first file, we could use condition NR==FNR
. During the first file processing, I used two hash tables, namely pattern_ht
to store column 1 values in lower case and note_ht
to store column 3 values. Since date string itself is not unique to serve as key, we will use the concatenation of columns 2 and 3 as key for both hash tables.
When processing the second file, firstly check if date string matches column 1 exactly after performing a split of the key string. If matched, perform the partial matching of column 2 to pattern_ht
values after converting to lower case. If it matches and this is the first match, record the third_col
value. If it already has another value, append to it with ";". Finally, display accordingly:
awk -F'\' 'NR==FNR {pattern_ht[$2";"$3] = tolower($1); note_ht[$2";"$3] = $3; next}
{third_col="";
for (key in pattern_ht) {
split(key,date_str,";");
if (date_str[1] == $1) {
if (tolower($2) ~ pattern_ht[key]) {
if (length(third_col) == 0)
third_col = note_ht[key]
else
third_col = third_col ";" note_ht[key]
}
}
}
if (length(third_col) != 0)
print $1"\\"$2"\\"third_col"\\"
else
print $1"\\"$2"\\"
}' fileA fileB
Here is the output:
2022.10.10\A?\note a;note b\
2022.10.14\B?\note c\
2022.10.14\a\note d\
2022.10.15\C\note e\
2022.10.15\D\