I am meeting some difficulties when using AWK to search a huge csv file (this can be called file1). fortunately, I have a list file (this can be called file2). I can just search the rows that I need basing on the index list file in file2. however, the file1 is not like any other normal file, it's like:
ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID3, P01723;F08734;,ID3_name
ID4, AC0014;AC0114;P01112;,ID4_name
...
IDn, AC0006;,IDn_name
IDm, Ac8007; P01167;,IDm_name
the index file2 like:
AC000112
AC000801
P01112
P01167
the desired output should be:
ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name
if I use
awk -F, 'NR==FNR{a[$1]; next} ($2 in a)' file2 file1
I will get nothing, if I add ";" at the end of each line in file2, I will only get ID2, AC000801;,ID2_name
. and if I change $2 ~ a[$1]
, it still didn't work.
So, I was wondering how to change this command to get the desired result. Thanks!
CodePudding user response:
You could set the field separator to a comma followed by optional spaces [[:space:]]*,[[:space:]]*
Then you can split the second field of file 1 on a semicolon and optional spaces [[:space:]]*;[[:space:]]*
and check if one of those is present in a
awk -F"[[:space:]]*,[[:space:]]*" 'NR==FNR{
a[$1]; next
}
{
split($2, parts, /[[:space:]]*;[[:space:]]*/)
for (i in parts) {
if (parts[i] in a) {
print $0; break;
}
}
}
' file2 file1
Output
ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name
CodePudding user response:
Assumptions:
- the search strings only consist of characters and numbers
One awk
idea where we append word boundary flags to our search patterns and then perform regex comparisons:
awk -F',' '
FNR==NR { regs["\\<" $1 "\\>"]; next }
{ for (regex in regs)
if ($2 ~ regex) { print; next }
}
' file2 file1
This generates:
ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name
CodePudding user response:
With your shown samples, please try following awk
code.
awk -F',|[[:space:]] |;' '
FNR==NR{
for(i=2;i<=NF;i ){
arr[$i]=$0
}
next
}
($0 in arr){
print arr[$0]
}
' file1 file2
Explanation: Adding detailed explanation for above code.
awk -F',|[[:space:]] |;' ' ##Setting field separator as comma, space(s), semi-colon here.
FNR==NR{ ##This condition will be TRUE when file1 is being read.
for(i=2;i<=NF;i ){ ##Using for loop to traverse from 2nd field to till last field.
arr[$i]=$0 ##Creating arr with index of current field, with value of current line.
}
next ##next will skip all further lines from here.
}
($0 in arr){ ##Checking condition if current line is present in arr.
print arr[$0] ##Printing arr with index of $0 here.
}
' file1 file2 ##Mentioning Input_file names here.