awk search content if it contains content in a list file-CodePudding

I am meeting some difficulties when using AWK to search a huge csv file (this can be called file1). fortunately, I have a list file (this can be called file2). I can just search the rows that I need basing on the index list file in file2. however, the file1 is not like any other normal file, it's like:

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID3, P01723;F08734;,ID3_name
ID4, AC0014;AC0114;P01112;,ID4_name
...
IDn, AC0006;,IDn_name
IDm, Ac8007; P01167;,IDm_name

the index file2 like:

the desired output should be:

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

if I use

awk -F, 'NR==FNR{a[$1]; next} ($2 in a)' file2 file1

I will get nothing, if I add ";" at the end of each line in file2, I will only get ID2, AC000801;,ID2_name. and if I change $2 ~ a[$1], it still didn't work.

So, I was wondering how to change this command to get the desired result. Thanks!

CodePudding user response：

You could set the field separator to a comma followed by optional spaces [[:space:]]*,[[:space:]]*

Then you can split the second field of file 1 on a semicolon and optional spaces [[:space:]]*;[[:space:]]* and check if one of those is present in a

awk -F"[[:space:]]*,[[:space:]]*" 'NR==FNR{
  a[$1]; next
}
{
  split($2, parts, /[[:space:]]*;[[:space:]]*/)
  for (i in parts) {
    if (parts[i] in a) {
      print $0; break;
    }
  } 
}
' file2 file1

Output

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

CodePudding user response：

Assumptions:

the search strings only consist of characters and numbers

One awk idea where we append word boundary flags to our search patterns and then perform regex comparisons:

awk -F',' '
FNR==NR { regs["\\<" $1 "\\>"]; next }
        { for (regex in regs)
              if ($2 ~ regex) { print; next }
        }
' file2 file1

This generates:

ID1, AC000112;AC000634;B0087;P01116;,ID1_name
ID2, AC000801;,ID2_name
ID4, AC0014;AC0114;P01112;,ID4_name
IDm, Ac8007; P01167;,IDm_name

CodePudding user response：

With your shown samples, please try following awk code.

awk -F',|[[:space:]] |;' '
FNR==NR{
  for(i=2;i<=NF;i  ){
    arr[$i]=$0
  }
  next
}
($0 in arr){
  print arr[$0]
}
' file1 file2

Explanation: Adding detailed explanation for above code.

awk -F',|[[:space:]] |;' '  ##Setting field separator as comma, space(s), semi-colon here.
FNR==NR{                    ##This condition will be TRUE when file1 is being read.
  for(i=2;i<=NF;i  ){       ##Using for loop to traverse from 2nd field to till last field.
    arr[$i]=$0              ##Creating arr with index of current field, with value of current line.
  }
  next                      ##next will skip all further lines from here.
}
($0 in arr){                ##Checking condition if current line is present in arr.
  print arr[$0]             ##Printing arr with index of $0 here.
}
' file1 file2               ##Mentioning Input_file names here.