Home > Software design >  How can I randomly delete a set number of characters from a file?
How can I randomly delete a set number of characters from a file?

Time:06-13

I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.

So let's say I have a fasta file like this...

>GCA_900186885_1_000000000001
ATGCAAACATTTGTAAAAAACTTAATCGAT

I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:

>GCA_900186885_1_000000000001
ATGAAACATTGTAAAAACTTAATCGAT

The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.

I have searched forums like stackoverflow and biostars for some related questions, but have not found anything useful.

I tried the following adaptation of a suggestion from another user to randomly delete lines from a file, but it didn't work.

filename=/Users/home/DETECTION/GCA_900186885.1_48903_D01_genomic_reformatted.fa
number=1918928

NT_count="$(grep -v ">" $filename | grep -E -o "G|C|T|A|N" | wc -l)"
NT_nums_to_delete="$(shuf -i "1-$NT_count" -n "$number")"
sed_script="$(printf '           
  • Related