I'm trying to delete lines that contain a certain pattern and the line directly above this specific pattern in a file. The pattern is 'Query '. The file looks something like this:
1. Query= ENST00000641267.1
2. Query= ENST00000641448.1
3. Query= MSTRG.3294.1
4. Query= ENST00000435134.2
5. Query= ENST00000503142.1
6. Query= ENST00000503142.1
7. Query 8 THSLRYFRLGVSDPIHGVPEFISVGYVDSHPITTYDSVTQQKEPRAPWMAENLVPDHWER 187
8. Query 188 YTQLLKGWQQMFRVELKRQQRHYNHSGSHTYQRMIGCELLEDGSTTGFLQYAYDGQNFLI 367
9. Query 368 FNKDTLS*LAVDNVAHTIKRAREANQHELQYQKNWLEEECIA*LKRFLEYGKDTQQ 535
10. Query= ENST00000612670.1
11. Query 1 MVFTQAPAEIMGHLRICSLLARQCLAEFLGVFVLMLLTQGAVAQAVTSGETKGNFFTMFL 180
12. Query 181 AGSLAVTIAIYVGGNVSG 234
13. Query= MSTRG.3309.1
So line 6 to 12 should be deleted while all other lines should be preserved. I've tried the following to remove the line before the pattern but can't get it to work:
tac | sed '/Query /'I, 1 d' | tac file.txt > newfile.txt
It just outputs the '>' sign. Can anyone help with this?
Desired output is:
1. Query= ENST00000641267.1
2. Query= ENST00000641448.1
3. Query= MSTRG.3294.1
4. Query= ENST00000435134.2
5. Query= ENST00000503142.1
13. Query= MSTRG.3309.1
Thanks!
CodePudding user response:
This might work for you (GNU sed):
sed '$!N;/\n.*Query /D;/Query /!P;D' file
Append the next line (unless the current line is the last line).
If the appended line contains Query
, delete the first line and go again.
If the first line of the 2 line window contains Query
, don't print it.
Otherwise print the first of the 2 lines, delete it and go again.
N.B. The appending of the next line is dependant on it not being the last, as the default behaviour of sed is print the pattern space if the N
command is called to read passed the end of the file. This allows the last line to treated properly i.e. if the last line contains Query
it will be deleted.
CodePudding user response:
$ tac file | awk '/Query /{c=2} !(c&&c--)' | tac
1. Query= ENST00000641267.1
2. Query= ENST00000641448.1
3. Query= MSTRG.3294.1
4. Query= ENST00000435134.2
5. Query= ENST00000503142.1
13. Query= MSTRG.3309.1
See Printing with sed or awk a line following a matching pattern for more info.
CodePudding user response:
I would use GNU AWK
following way, let file.txt
content be
1. Query= ENST00000641267.1
2. Query= ENST00000641448.1
3. Query= MSTRG.3294.1
4. Query= ENST00000435134.2
5. Query= ENST00000503142.1
6. Query= ENST00000503142.1
7. Query 8 THSLRYFRLGVSDPIHGVPEFISVGYVDSHPITTYDSVTQQKEPRAPWMAENLVPDHWER 187
8. Query 188 YTQLLKGWQQMFRVELKRQQRHYNHSGSHTYQRMIGCELLEDGSTTGFLQYAYDGQNFLI 367
9. Query 368 FNKDTLS*LAVDNVAHTIKRAREANQHELQYQKNWLEEECIA*LKRFLEYGKDTQQ 535
10. Query= ENST00000612670.1
11. Query 1 MVFTQAPAEIMGHLRICSLLARQCLAEFLGVFVLMLLTQGAVAQAVTSGETKGNFFTMFL 180
12. Query 181 AGSLAVTIAIYVGGNVSG 234
13. Query= MSTRG.3309.1
then
awk 'NR>1&&!/Query /&&prev!~/Query /{print prev}{prev=$0}END{if(prev!~/Query /){print prev}}' file.txt
output
1. Query= ENST00000641267.1
2. Query= ENST00000641448.1
3. Query= MSTRG.3294.1
4. Query= ENST00000435134.2
5. Query= ENST00000503142.1
13. Query= MSTRG.3309.1
Explanation: I use prev
variable to store previous line, if current line does not match Query
and previous line does not match Query
then I print previous line. As I print
previous line I need to consider last line separately, for which I use END
.
(tested in GNU Awk 5.0.1)