I have a TAB-delimited file, which has blocks of text that I don't want. How would I selectively remove the block of text but not the tab-delimited contents below the block of text. I tried to use sed commands but realized that the block of code appears multiple times across the file. Is there a way to delete all blocks of text in the file. An example of my file is shown below
Using Custom Genome
Processing chr1
Reading input files...
24896 total sequences read
1 motifs loaded
Finding instances of 1 motif(s)
|0% 50% 100%|
=================================================================================
chr1 19569 19578 3-TGTAAACA,BestGuess:FOXO6 7.527650
chr1 24247 24256 3-TGTAAACA,BestGuess:FOXO6 6.917285
chr1 31424 31433 3-TGTAAACA,BestGuess:FOXO6 6.917285 -
chr1 32811 32820 3-TGTAAACA,BestGuess:FOXO6 6.757443 -
chr1 33201 33210 3-TGTAAACA,BestGuess:FOXO6 9.114025 -
chr1 39608 39617 3-TGTAAACA,BestGuess:FOXO6 6.037806 -
chr1 42262 42271 3-TGTAAACA,BestGuess:FOXO6 8.442267 -
Essentially everything above the dashed lines need to be removed (including the dashed lines). You may be thinking why not just remove the first 12 lines, but the problem is that the block of text appears once again many thousands of lines below and is interspersed a total of 22 times. The next occurence of the block of text is shown below
chr1 248936049 248936058 3-TGTAAACA,BestGuess:FOXO6 7.978484
chr1 248937454 248937463 3-TGTAAACA,BestGuess:FOXO6 7.065060
chr1 248943583 248943592 3-TGTAAACA,BestGuess:FOXO6 8.232793 -
Reading input files...
13380 total sequences read
1 motifs loaded
Finding instances of 1 motif(s)
|0% 50% 100%|
=================================================================================
chr10 19750 19759 3-TGTAAACA,BestGuess:FOXO6 6.680601
Is there anyway to essentially search for these box of texts and remove them because I otherwise can't process them further while these blocks of text are interspersed. Please let me know if there's a solution because I've tried many sed commands but quite a few times its deleted everything (likely the files are backed up).
CodePudding user response:
Maybe you could search for lines that have at least three tabs eg. grep "\t.*\t.*\t" file.tsv
CodePudding user response:
Assuming:
- The desired lines start with
chr
. - The desired tab-delimited contents have 6 fields.
then would you please try:
awk -F'\t' '/^chr/ && NF == 6' input_file
If my assumption is incorrect, please let me know.
CodePudding user response:
We don't know the desired output, but perhaps using awk
this can help you, only if the first field repeats this sequence of characters: beggins with chr
chars followed by one or more digits at the end:
awk '$1 ~ /^chr[0-9] $/' file