I have a large workflow that gets tripped up by uncharacterized chromosomes - a process produces a count matrix that has n fields for canonical chromosomes, and for lines with uncharacterized chromosomes, the fields are n 1 and n 2. This is a headache for using read.table()
downstream.
My approach is to first identify what n is, and use this to isolate the n 1 and n 2 lines containing these uncharacterized chromosomes:
awk -v nf="$canon" 'NF!=nf{print}{}' matrix.txt | head
chr22 KI270733v1 random 123189 123362 6 4 8 0 0 10
chrUn GL000220v1 105951 106963 - 0 0 0 0 10 0
The goal is for these lines to match the number of fields n by joining the 1st and 2nd columns where n 1 and the 1st, 2nd and 3rd columns where n 2 to produce:
chrUn-GL000220v1 105951 106963 - 0 0 0 0 10 0
chr22-KI270733v1-random 123189 123362 6 4 8 0 0 10
Attempt
I could subset the matrix and split it into 3 files, one for NF==n, NF==n 1 & NF==n 2 and join the columns:
awk -v n="$canon" 'NF==n{print}{}' matrix.txt | head
chr1 15534236 15536814 - 0 10 0 0 0 3
(^ no action needed)
awk -v n="$canon" 'NF==n 1{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2,$3,$4,$5,$6,$7,$8,$9,$10}' | head
chrUn-GL000220v1 105992 107309 - 0 0 0 0 0 4
and
awk -v n="$canon" 'NF==n 2{print}{}' matrix.txt | awk -v OFS="\t" '{print $1"-"$2"-"$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | head
chr22-KI270733v1-random 123189 123362 6 4 8 0 0 10
Unfortunately, this solution is not dynamic - I have to specify the range of columns. The workflow could contain any number of columns after the first four detailing Chr, Start, Stop, Strand.
Hopefully I have defined the problem well, any suggestions would be greatly appreciated.
CodePudding user response:
Try:
awk -v n=13 '{ for (i = 2; i <= NF - n 1; i) { $1 = $1"-"$i; $i=""; } } 1'
Accumulate into $1
and clean $i=""
the rest.
You could also move values to the left if (NF != n) for (i = 2; i < NF; i) $i=$(i (NF-n))
values and set NF=n
.