bash/awk code to convert csv table format-CodePudding

I'm new in bash/awk. could you help me how could I solve this problem ? I want to write a small script for copying the miRNAs name on the left of each sequence until a new one is found. file is in CSV format. Thanks

input file:

Organism: hsa,
,let-7a-2-3p
,,CTGTACAGCCTCCTAGCTTTCC,
,,Totals: ,
,mir-7a-3p
,,CTATACAATCTACTGTC,
,,CTATACAATCTACTGTCT,

want convert like this:

Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC
Organism: hsa,let-7a-2-3p,Totals: 
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT

any help?

awk / bash code to convert

CodePudding user response：

With awk:

awk 'BEGIN{FS=OFS=","}
     {
       if($1!=""){org=$1; next}
       if(NF==2) {foo=$2; next}
       if(NF==4) {print org, foo, $3}
     }' file

This awk code processes a file line by line. At the beginning of the file, it sets the field separator (FS) and the output field separator (OFS) to be a comma.

For each line, it checks if the first field is not empty. If it is not, it assigns the value of the first field to the variable "org" and skips to the next line. If the number of fields (NF) is 2, it assigns the second field to the variable "foo" and skips to the next line. Finally, if the number of fields is 4, it prints the value of "org", "foo", and the third field, separated by commas.

Output:

Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC
Organism: hsa,let-7a-2-3p,Totals: 
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

CodePudding user response：

Slightly rephrasing the OP ask to:

Find the first non-empty field in each line
Fill the empty columns till that field with value of previous line

While possible to do in bash, awk has easier syntax for those tasks, and is much faster.

Solution using #! to execute awk, replace with awk -f filename if you prefer.

#! /usr/bin/awk -f
BEGIN {
    OFS = FS = ","   # comma delimited input/output
    np = 0           # Number of elements in previous line
}
{
    for (i=1 ; i <= NF ; i   ) {
        # Check if first non-empty - break loop
        if ( $i != "" || i > np ) break
        # Copy values from previous line
        $i = p[i]
    }
    print
    # Update p/np with current data, for next record processing
    for (j=i ; j<=NF ; j  ) p[j] = $j
    np = NF
}

Notes:

p[i] stored the value of field #i in the previous record
pn store the current number of items in p.
Tested under Ubunto/Windows. Should work on Linux as well.

CodePudding user response：

Easy enough in raw bash -

while IFS=, read org rna dat _
do [[ -n "$org" ]] && organism="$org" && continue
   [[ -n "$rna" ]] && miRNA="$rna"    && continue
   printf "%s,%s,%s\n" "$organism" "$miRNA" "$dat"
done < file.in

or even

while IFS=, read org rna dat _; do
  case "$org,$rna,$dat" in
  ?*,,) organism="$org"                                 ;;
  ,?*,) miRNA="$rna"                                    ;;
  ,,?*) printf "%s,%s,%s\n" "$organism" "$miRNA" "$dat" ;;
  esac
done < file.in

Output from either -

Organism: hsa,let-7a-2-3p,CTGTACAGCCTCCTAGCTTTCC
Organism: hsa,let-7a-2-3p,Totals:
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTC
Organism: hsa,mir-7a-3p,CTATACAATCTACTGTCT
Organism: foo,abc-1x-8-8r,lalaladeda
Organism: foo,abc-1x-8-8r,Totals: (should something go here?)
Organism: foo,and-2a-3p,DADADADADADADADAD
Organism: foo,and-2a-3p,CTATACAATCTACTGTCT

Still, awk will likely be a lot faster and more efficient, especially on files of any size.