Home > database >  Extracting multiple strings and combining to .csv in bash
Extracting multiple strings and combining to .csv in bash

Time:07-28

I have a long list of ID's I need to parse. I want to extract three pieces of information and write to a 3 column CSV. Column 1 = the field between tr|XXXX|, column 3 = the field after the second | but before OS=.

Column 2 would be conditional. If there is 'GN=XXX' in the line, I'd like it to return XXX. If GN= isn't present, I'd like to write the first section of column 3 (i.e. up to the first space).

Input:

>tr|I1WXP1|I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment) OS=uncultured euryarchaeote OX=114243 GN=mcrA PE=4 SV=1
>tr|A0A059VAR9|A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment) OS=Halorubrum sp. Ga66 OX=1480727 GN=atpB PE=3 SV=1
>tr|Q51760|Q51760_9EURY Glutaredoxin-like protein OS=Pyrococcus furiosus OX=2261 PE=1 SV=1

Desired output:

I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB, A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

I can get the first two with awk, for e.g.:

awk '{split($0,a,"|"); print a[2]

But I can't work out the conditional, or how to act on the 'GN=' pattern neatly.

So for example, extracting bold text:

tr|**I1WXP1**|**I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)** OS=uncultured euryarchaeote OX=114243 GN=**mcrA** PE=4 SV=1

Becomes:

I1WXP1, mcrA, I1WXP1_9EURY Methyl coenzyme M reductase subunit A(Fragment)

CodePudding user response:

With your shown samples, please try following awk code. Written and tested with shown samples in GNU awk. It will generate comma separated values and you can output into a new file as per your wish.

awk -F'^>tr\\|| OS=' '
BEGIN{ OFS="," }
NF>=2{
  gsub(/\|/,OFS,$2)
  match($0,/GN=(\S )/,arr1)
  if(!arr1[1]){
    split($2,arr2,"[, ]")
  }
  printf("%s%s\n",$2,arr1[1]?OFS arr1[1]:OFS arr2[2]"")
}
'  Input_file
  • Related