I want to extract lines to create files for each chromosome. My source file is look like as below:
I could run awk '{OFS="\t"} {split($1, a, "r")} a[2]==1 {print $0}' MISA.bed > chr_1.bed
. This way take much time to do manually
I tried this and it did create correct list of files but they were all empty
for i in {1..22}; do awk '{OFS="\t"} $1=="chr${i}" {print $0}' MISA.bed > chr_${i}.bed; done
CodePudding user response:
how about:
awk -F '\t' '{F=sprintf("chr_%s.bed",$1); print >> F}' MISA.bed
CodePudding user response:
Bash variables and awk variables are separate and distinct. For your bash loop, it is possible to send i
as a named variable to awk using the -v var="value"
switch. But, in your case, there is no need as the first part of your output filename is data field 1 ($1
) within awk.
The only other thing to know is that prints from within awk can be redirected to file in awk also using >
as for shell (>> is not needed for appending).
The field separator does not need to be reset for your example.
This simple awk procedure will put each line of the input file (MISA.bed
) to a matched output .bed
file pre-fixed with the value in field 1 of the data (chr1.bed, chr2.bed, etc.)
awk '{print $0 > $1".bed"}' MISA.bed
Most modern versions of awk should handle this. Some versions might require the concatenated filename to be pre-formed (fn=$1".bed"; print $0 > fn
).
As your files contain chromosome data, you will process 20-some output files which should not pose any problems (they remain open until awk finishes). If an error relating to "too many files" is encountered, a final close(fn)
statement could be added to the awk block (this will almost certainly not be needed)
test
test file MISA.bed
chr1 10286 10321 6 AACCCC
chr1 10386 10421 2 CT
chr1 10486 10521 2 TC
chr2 10186 10221 2 TC
chr2 10286 10321 2 AT
chr2 10486 10521 6 CCAACC
chr3 10486 10521 2 TC
chr3 10586 10621 2 AT
chr3 10686 10721 6 CCAACC
command:
awk '{print $0 > $1".bed"}' MISA.bed
file checks:
> ls *.bed
> chr1.bed chr2.bed chr3.bed MISA.bed
> cat chr1.bed
chr1 10286 10321 6 AACCCC
chr1 10386 10421 2 CT
chr1 10486 10521 2 TC
> cat chr2.bed
chr2 10186 10221 2 TC
chr2 10286 10321 2 AT
chr2 10486 10521 6 CCAACC
> cat chr3.bed
chr3 10486 10521 2 TC
chr3 10586 10621 2 AT
chr3 10686 10721 6 CCAACC
tested using GNU Awk 5.1.0, API: 3.0 on raspberry pi 400.