I need to delete any paragraphs/blocks of text that begin with a time in HH:mm format and end with a specific string from a large number of files in a folder. Each paragraph that needs deleting ends with the string #file. Between each paragraph is a blank line. Is it possible to delete everything between these two? Sample file as follows:
00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No
19:30
-Paragraph two.
-Can be multiple lines.
-Delete this paragraph. #file
13.30
-Paragraph three. Delete this. #file
So ideally what would be left is:
00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No
The paragraphs won't ever be the first paragraph of the document, but they could be the last.
I'm no expert so I've been trying things I found online with no luck. Thanks for any help you can give me!
CodePudding user response:
Here is an alternative. Not 100% clear what the requirements for the first paragraph are, so I've printed it unconditionally.
#!/usr/bin/perl
use warnings;
use strict;
# set paragraph mode
local $/ = "";
# always print first paragraph
print scalar <DATA> ;
while (<DATA>)
{
print
unless /^\d\d:\d\d.*#file\s*$/s;
}
__DATA__
01:01
keep this text even though it has #file
another paragraph with no data
01:00
delete this #file
07:057
delete this as well #file
keep this paragraph
02:33
keep me
running produces this output
01:01
keep this text even though it has #file
another paragraph with no data
keep this
02:33
keep me
CodePudding user response:
Here's a solution in Perl. As you haven't shown us what you've tried, I'm not going to explain how it works. But it's written as a Unix filter.
#!/usr/bin/perl
use strict;
use warnings;
my $buffer;
while (<>) {
if (/^\d\d:\d\d\s*$/) {
end_of_para($buffer);
$buffer = $_;
} else {
$buffer .= $_
}
}
end_of_para($buffer);
sub end_of_para {
my ($para) = @_;
if ($para and $para !~ /#file\s*\z/) {
print $para;
}
}
Update: You've changed the sample input file. That makes things far simpler.
#!/usr/bin/perl
use strict;
use warnings;
local $/ = '';
while (<>) {
print unless /#file\s*\z/;
}
CodePudding user response:
I would harness GNU AWK
for this task following way, let file.txt
content be
00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No
19:30
-Paragraph two.
-Can be multiple lines.
-Delete this paragraph. #file
13.30
-Paragraph three. Delete this. #file
then
awk 'BEGIN{RS=""}!/#file/' file.txt
gives output
00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No
Explanation: I set RS
(row separator) to empty string which triggers paragraph mode, so rows are separatored by one or more blank lines, then I select rows which do not (!
) contains #file
. If there is more than one item to keep there will be no blank line between them, if this is desired replace RS=""
using RS=ORS="\n\n"
.
(tested in GNU Awk 5.0.1)
CodePudding user response:
Using any awk in any shell on every Unix box:
$ awk -v RS= -v ORS='\n\n' '!/#file$/' file
00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No
CodePudding user response:
This might work for you (GNU sed):
sed -En '/^[0-9]{2}:[0-9]{2}/{:a;$!{N;/\n$/!ba};/#file\n?$/!p}' file
Turn on extended regexp's and off implicit printing.
Gather up lines starting with HH:MM
and ending in a blank line or end of file. If the last string is not #file
print the result.
Repeat.