I need to compare two input files (file1.csv and file2.csv) and store the results and a third file (file3.csv).
The conditions are as follows:
- If entry matches file1 and file2, store it on file3
- If entry is found on file1, store it on file3
It looks like this:
input: file1.csv
"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"
input: file2.csv
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"
results: file3.csv
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"switch6"
"switch10"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
I have tried lots of combinations using 'awk'; however I cannot construct the conditionals.
I was wondering if someone could help me construct the conditional(s) to build this results file3.csv.
Assistance is much appreciated.
CodePudding user response:
I would use GNU AWK
for this task following way, let file1.txt
content be
"switch10"
"switch33"
"router51"
"switch6"
"router44"
"router12"
"switch2"
and file2.txt
content be
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
"router44";"DatacenterC - cab2";"Test - Tenant2"
then
awk 'BEGIN{FS=";"}(NR==FNR){arr[$1]=$0}((NR!=FNR)&&($1 in arr)){arr[$1]=$0}END{for(i in arr){print arr[i]}}' file1.txt file2.txt
output
"switch6"
"router12"
"router44";"DatacenterC - cab2";"Test - Tenant2"
"router51";"DatacenterA - cab1";"Prod - Tenant12"
"switch33";"DatacenterB - cab14";"Prod - Tenant4"
"switch10"
"switch2";"DatacenterA - cab3";"Dev - Tenant5"
Disclaimer: I assume that you are able to accept any order of lines in output file. Explanation: I inform GNU AWK
that ;
is used as field separator (FS
) then when processing first file (NR==FNR
) I do fill array arr
by putting whole current line ($0
) under key which is content of first field ($1
). When processing next files (NR!=FNR
) if there is value corresponding to first field of line in array arr
then I update its value to whole current line. After all files are processed I do print
all values from array arr
using for
.
(tested in gawk 4.2.1)
CodePudding user response:
Similar idea as @Daweo but:
- use the file names provided
- hard-code file names in script instead of arguments as they are not interchangeable
- yield result in requested order
You would save the script as task
, chmod 755 task
then run it as ./task > file3.csv
:
#!/usr/bin/env -S awk -F; -f
BEGIN {
while ( getline <"file2.csv" ) {
lookup[$1] = $0
}
while ( "sort -k1.1,1.7 -k1.8n file1.csv" | getline ) {
print ($0 in lookup) ? lookup[$0] : $0
}
}
Note: sort used here is fragile as it relies on fixed size prefix (router or switch) to identify the two parts of the compound key.
Using gawk's PROCINFO["sorted_in"] feature to return keys in a specify order and using a custom compare function str_num_cmp()
makes it a bit more robust:
#!/usr/bin/env -S awk -F; -f
function str_num_to_a(i, a) {
match(i, /"([^0-9]*)([0-9] )"/, a)
}
function str_num_cmp(i1, v1, i2, v2) {
str_num_to_a(i1, a1)
str_num_to_a(i2, a2)
if(a1[1] < a2[1])
return -1
if(a1[1] == a2[1]) {
if(a1[2] < a2[2])
return -1
if(a1[2] == a2[2])
return 0
}
return 1
}
BEGIN {
while ( getline <"file1.csv" ) {
keys[$1] = ""
}
while ( getline <"file2.csv" ) {
lookup[$1] = $0
}
PROCINFO["sorted_in"] = "str_num_cmp"
for (k in keys) {
print (k in lookup) ? lookup[k] : k
}
}