Home > Blockchain >  How to remove a block of text from a TAB-delimited file that appears more than once on linux?
How to remove a block of text from a TAB-delimited file that appears more than once on linux?

Time:04-07

I have a TAB-delimited file, which has blocks of text that I don't want. How would I selectively remove the block of text but not the tab-delimited contents below the block of text. I tried to use sed commands but realized that the block of code appears multiple times across the file. Is there a way to delete all blocks of text in the file. An example of my file is shown below

        Using Custom Genome

        Processing chr1

        Reading input files...
        24896 total sequences read
        1 motifs loaded
        Finding instances of 1 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================
chr1    19569   19578   3-TGTAAACA,BestGuess:FOXO6      7.527650         
chr1    24247   24256   3-TGTAAACA,BestGuess:FOXO6      6.917285         
chr1    31424   31433   3-TGTAAACA,BestGuess:FOXO6      6.917285        -
chr1    32811   32820   3-TGTAAACA,BestGuess:FOXO6      6.757443        -
chr1    33201   33210   3-TGTAAACA,BestGuess:FOXO6      9.114025        -
chr1    39608   39617   3-TGTAAACA,BestGuess:FOXO6      6.037806        -
chr1    42262   42271   3-TGTAAACA,BestGuess:FOXO6      8.442267        -

Essentially everything above the dashed lines need to be removed (including the dashed lines). You may be thinking why not just remove the first 12 lines, but the problem is that the block of text appears once again many thousands of lines below and is interspersed a total of 22 times. The next occurence of the block of text is shown below


chr1    248936049       248936058       3-TGTAAACA,BestGuess:FOXO6      7.978484         
chr1    248937454       248937463       3-TGTAAACA,BestGuess:FOXO6      7.065060         
chr1    248943583       248943592       3-TGTAAACA,BestGuess:FOXO6      8.232793        -

        Reading input files...
        13380 total sequences read
        1 motifs loaded
        Finding instances of 1 motif(s)
        |0%                                    50%                                  100%|
        =================================================================================
chr10   19750   19759   3-TGTAAACA,BestGuess:FOXO6      6.680601         

Is there anyway to essentially search for these box of texts and remove them because I otherwise can't process them further while these blocks of text are interspersed. Please let me know if there's a solution because I've tried many sed commands but quite a few times its deleted everything (likely the files are backed up).

CodePudding user response:

Maybe you could search for lines that have at least three tabs eg. grep "\t.*\t.*\t" file.tsv

CodePudding user response:

Assuming:

  • The desired lines start with chr.
  • The desired tab-delimited contents have 6 fields.

then would you please try:

awk -F'\t' '/^chr/ && NF == 6' input_file

If my assumption is incorrect, please let me know.

CodePudding user response:

We don't know the desired output, but perhaps using awk this can help you, only if the first field repeats this sequence of characters: beggins with chr chars followed by one or more digits at the end:

awk '$1 ~ /^chr[0-9] $/' file
  • Related