I'm new in bash/awk. could you help me how could I solve this problem ? I want to write a small script for copying the miRNAs name on the left of each sequence until a new one is found. file is in CSV format. Thanks
input file:
Organism: hsa,
,let-7a-2-3p
,,CTGTACAGCCTCCTAGCTTTCC,
,,Totals: ,
,mir-7a-3p
,,CTATACAATCTACTGTC,
,,CTATACAATCTACTGTCT,
want convert like this:
Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC
Organism: hsa,let-7a-2-3p,Totals:
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT
any help?
awk / bash code to convert
CodePudding user response:
With awk:
awk 'BEGIN{FS=OFS=","}
{
if($1!=""){org=$1; next}
if(NF==2) {foo=$2; next}
if(NF==4) {print org, foo, $3}
}' file
This awk code processes a file line by line. At the beginning of the file, it sets the field separator (FS) and the output field separator (OFS) to be a comma.
For each line, it checks if the first field is not empty. If it is not, it assigns the value of the first field to the variable "org" and skips to the next line. If the number of fields (NF) is 2, it assigns the second field to the variable "foo" and skips to the next line. Finally, if the number of fields is 4, it prints the value of "org", "foo", and the third field, separated by commas.
Output:
Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC Organism: hsa,let-7a-2-3p,Totals: Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
CodePudding user response:
Slightly rephrasing the OP ask to:
- Find the first non-empty field in each line
- Fill the empty columns till that field with value of previous line
While possible to do in bash, awk has easier syntax for those tasks, and is much faster.
Solution using #!
to execute awk, replace with awk -f filename
if you prefer.
#! /usr/bin/awk -f
BEGIN {
OFS = FS = "," # comma delimited input/output
np = 0 # Number of elements in previous line
}
{
for (i=1 ; i <= NF ; i ) {
# Check if first non-empty - break loop
if ( $i != "" || i > np ) break
# Copy values from previous line
$i = p[i]
}
print
# Update p/np with current data, for next record processing
for (j=i ; j<=NF ; j ) p[j] = $j
np = NF
}
Notes:
- p[i] stored the value of field #i in the previous record
- pn store the current number of items in p.
- Tested under Ubunto/Windows. Should work on Linux as well.
CodePudding user response:
Easy enough in raw bash
-
while IFS=, read org rna dat _
do [[ -n "$org" ]] && organism="$org" && continue
[[ -n "$rna" ]] && miRNA="$rna" && continue
printf "%s,%s,%s\n" "$organism" "$miRNA" "$dat"
done < file.in
or even
while IFS=, read org rna dat _; do
case "$org,$rna,$dat" in
?*,,) organism="$org" ;;
,?*,) miRNA="$rna" ;;
,,?*) printf "%s,%s,%s\n" "$organism" "$miRNA" "$dat" ;;
esac
done < file.in
Output from either -
Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC
Organism: hsa,let-7a-2-3p,Totals:
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT
Organism: foo,abc-1x-8-8r,lalaladeda
Organism: foo,abc-1x-8-8r,Totals: (should something go here?)
Organism: foo,and-2a-3p,DADADADADADADADAD
Organism: foo,and-2a-3p,CTATACAATCTACTGTCT
Still, awk
will likely be a lot faster and more efficient, especially on files of any size.