I am wanting to use grep/awk/sed to parse a text file containing various descriptions of several genes. I would like each row to represent a gene description.
Right now I am wanting to extract the Automated and Concise descriptions into single txt files each row representing the description for a single gene.
Download file
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
I have been able to extract the desired text and have individual text files using the code below. However, I am unable to output the text into single rows.
awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt
#do this for the next section automated description
awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt
#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt
Can someone help?
Current text structure for 1 gene description
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
desired text structure for 1 gene description
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
Thank you, Jose.
CodePudding user response:
A few small changes to OP's current awk
code:
awk '
/Concise description:/ { flag=1; pfx="" }
/Automated description/ { flag=0; print "" } # close out current printf line out output
flag { printf "%s%s",pfx,$0; pfx=" " } # assuming appended lines are separated by a single space
' file
NOTE: I'm not sure I understand OP's current use of grep -v
since we don't have a sample set of input that demonstrates the need for the grep -v
... ?
For the small sample provided this generates:
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
Assumptions:
- OP needs to parse the input file twice (for two different blocks of text)
- the two different blocks of text do not overlap
- there may be multiple blocks of
Concise
orAutomated
text in the intput file, and all input is to be routed to one of two output files
We could consolidate OP's current 2x awk
scripts into one, eg:
awk '
function close_line() { if (outfile) print "" > outfile } # close out prior printf line of output?
/Concise description:/ { close_line()
outfile="WB283_concise.txt"
pfx=""
}
/Automated description:/ { close_line()
outfile="WB283_automated.txt"
pfx=""
}
/Gene class description/ { close_line()
outfile=""
}
outfile { printf "%s%s", pfx, $0 > outfile
pfx=" "
}
END { close_line() }
' file
CodePudding user response:
May I suggest a slightly modified solution (not exactly what is asked for, but with potentially useful thoughts):
awk '
/WBGene/ { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/ { flag = 0 }
/^.* description/ { flag = 0 }
flag { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
The idea is to filter out the string "Concise description", as this is what we are looking for in any case. The name of the gene is printed in the first column, as many "Concise description's" don't include the name.
Output format is a single line for each gene, starting with its name ( colon), followed by the "pure" concise description.
By the way: if you want to create a second output, with the "Automated description" in each line, change the second awk-line from /Concise description/
to /Automated description/