grep/sed/awk parsing text file to print multiple rows after a pattern matches and convert to one row-CodePudding

I am wanting to use grep/awk/sed to parse a text file containing various descriptions of several genes. I would like each row to represent a gene description.

Right now I am wanting to extract the Automated and Concise descriptions into single txt files each row representing the description for a single gene.

Download file

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

I have been able to extract the desired text and have individual text files using the code below. However, I am unable to output the text into single rows.

awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt

#do this for the next section automated description

awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt

#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt

Can someone help?

Current text structure for 1 gene description

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions.

desired text structure for 1 gene description

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.

Thank you, Jose.

CodePudding user response：

A few small changes to OP's current awk code:

awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }                # close out current printf line out output
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' file

NOTE: I'm not sure I understand OP's current use of grep -v since we don't have a sample set of input that demonstrates the need for the grep -v ... ?

For the small sample provided this generates:

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide  3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.

Assumptions:

OP needs to parse the input file twice (for two different blocks of text)
the two different blocks of text do not overlap
there may be multiple blocks of Concise or Automated text in the intput file, and all input is to be routed to one of two output files

We could consolidate OP's current 2x awk scripts into one, eg:

awk '
function close_line()    { if (outfile) print "" > outfile }      # close out prior printf line of output?

/Concise description:/   { close_line()
                           outfile="WB283_concise.txt"
                           pfx=""
                         }
/Automated description:/ { close_line()
                           outfile="WB283_automated.txt"
                           pfx=""
                         }
/Gene class description/ { close_line()
                           outfile=""
                         }
outfile                  { printf "%s%s", pfx, $0 > outfile
                           pfx=" "
                         }
END                      { close_line() }
' file

CodePudding user response：

May I suggest a slightly modified solution (not exactly what is asked for, but with potentially useful thoughts):

awk '
/WBGene/              { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/                   { flag = 0 }
/^.* description/     { flag = 0 }
flag                  { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

The idea is to filter out the string "Concise description", as this is what we are looking for in any case. The name of the gene is printed in the first column, as many "Concise description's" don't include the name.

Output format is a single line for each gene, starting with its name ( colon), followed by the "pure" concise description.

By the way: if you want to create a second output, with the "Automated description" in each line, change the second awk-line from /Concise description/ to /Automated description/