Create multiple files based on variable in BED file-CodePudding

I want to extract lines to create files for each chromosome. My source file is look like as below:

I could run awk '{OFS="\t"} {split($1, a, "r")} a[2]==1 {print $0}' MISA.bed > chr_1.bed. This way take much time to do manually

I tried this and it did create correct list of files but they were all empty

for i in {1..22}; do awk '{OFS="\t"} $1=="chr${i}" {print $0}' MISA.bed > chr_${i}.bed; done

CodePudding user response：

how about:

awk -F '\t' '{F=sprintf("chr_%s.bed",$1); print >> F}' MISA.bed

CodePudding user response：

Bash variables and awk variables are separate and distinct. For your bash loop, it is possible to send i as a named variable to awk using the -v var="value" switch. But, in your case, there is no need as the first part of your output filename is data field 1 ($1) within awk.

The only other thing to know is that prints from within awk can be redirected to file in awk also using > as for shell (>> is not needed for appending).

The field separator does not need to be reset for your example.

This simple awk procedure will put each line of the input file (MISA.bed) to a matched output .bed file pre-fixed with the value in field 1 of the data (chr1.bed, chr2.bed, etc.)

awk '{print $0 > $1".bed"}' MISA.bed

Most modern versions of awk should handle this. Some versions might require the concatenated filename to be pre-formed (fn=$1".bed"; print $0 > fn).

As your files contain chromosome data, you will process 20-some output files which should not pose any problems (they remain open until awk finishes). If an error relating to "too many files" is encountered, a final close(fn) statement could be added to the awk block (this will almost certainly not be needed)

test

test file MISA.bed

chr1    10286   10321   6   AACCCC
chr1    10386   10421   2   CT
chr1    10486   10521   2   TC
chr2    10186   10221   2   TC
chr2    10286   10321   2   AT
chr2    10486   10521   6   CCAACC
chr3    10486   10521   2   TC
chr3    10586   10621   2   AT
chr3    10686   10721   6   CCAACC

command:

awk '{print $0 > $1".bed"}' MISA.bed

file checks:

> ls *.bed
> chr1.bed  chr2.bed  chr3.bed  MISA.bed

> cat chr1.bed 
chr1    10286   10321   6   AACCCC
chr1    10386   10421   2   CT
chr1    10486   10521   2   TC

> cat chr2.bed 
chr2    10186   10221   2   TC
chr2    10286   10321   2   AT
chr2    10486   10521   6   CCAACC

> cat chr3.bed 
chr3    10486   10521   2   TC
chr3    10586   10621   2   AT
chr3    10686   10721   6   CCAACC

tested using GNU Awk 5.1.0, API: 3.0 on raspberry pi 400.