Home > OS >  Joining columns based on number of fields
Joining columns based on number of fields

Time:03-15

I have a large workflow that gets tripped up by uncharacterized chromosomes - a process produces a count matrix that has n fields for canonical chromosomes, and for lines with uncharacterized chromosomes, the fields are n 1 and n 2. This is a headache for using read.table() downstream.

My approach is to first identify what n is, and use this to isolate the n 1 and n 2 lines containing these uncharacterized chromosomes:

awk -v nf="$canon" 'NF!=nf{print}{}' matrix.txt | head
chr22   KI270733v1  random  123189  123362      6   4   8   0   0   10
chrUn   GL000220v1  105951  106963  -   0   0   0   0   10  0 

The goal is for these lines to match the number of fields n by joining the 1st and 2nd columns where n 1 and the 1st, 2nd and 3rd columns where n 2 to produce:

chrUn-GL000220v1    105951  106963  -   0   0   0   0   10  0
chr22-KI270733v1-random 123189  123362      6   4   8   0   0   10

Attempt

I could subset the matrix and split it into 3 files, one for NF==n, NF==n 1 & NF==n 2 and join the columns:

awk -v n="$canon" 'NF==n{print}{}' matrix.txt | head

chr1    15534236    15536814    -   0   10  0   0   0   3

(^ no action needed)

awk -v n="$canon" 'NF==n 1{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2,$3,$4,$5,$6,$7,$8,$9,$10}' | head

chrUn-GL000220v1    105992  107309  -   0   0   0   0   0   4

and

awk -v n="$canon" 'NF==n 2{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2"-"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | head

chr22-KI270733v1-random 123189  123362      6   4   8   0   0   10

Unfortunately, this solution is not dynamic - I have to specify the range of columns. The workflow could contain any number of columns after the first four detailing Chr, Start, Stop, Strand.


Hopefully I have defined the problem well, any suggestions would be greatly appreciated.

CodePudding user response:

Try:

awk -v n=13 '{ for (i = 2; i <= NF - n   1;   i) { $1 = $1"-"$i; $i=""; } } 1'

Accumulate into $1 and clean $i="" the rest.

You could also move values to the left if (NF != n) for (i = 2; i < NF; i) $i=$(i (NF-n)) values and set NF=n.

  • Related