Home > Mobile >  Delete all paragraphs that begin with time (HH:mm) and end with specific string from multiple files
Delete all paragraphs that begin with time (HH:mm) and end with specific string from multiple files

Time:01-24

I need to delete any paragraphs/blocks of text that begin with a time in HH:mm format and end with a specific string from a large number of files in a folder. Each paragraph that needs deleting ends with the string #file. Between each paragraph is a blank line. Is it possible to delete everything between these two? Sample file as follows:

00:00  
-Paragraph one. Can be multiple  
lines. Paragraph one. Don't delete  
this paragraph. #No

19:30  
-Paragraph two.  
-Can be multiple lines.  
-Delete this paragraph. #file

13.30  
-Paragraph three. Delete this. #file

So ideally what would be left is:

00:00  
-Paragraph one. Can be multiple  
lines. Paragraph one. Don't delete  
this paragraph. #No

The paragraphs won't ever be the first paragraph of the document, but they could be the last.

I'm no expert so I've been trying things I found online with no luck. Thanks for any help you can give me!

CodePudding user response:

Here is an alternative. Not 100% clear what the requirements for the first paragraph are, so I've printed it unconditionally.

#!/usr/bin/perl

use warnings;
use strict;

# set paragraph mode
local $/ = "";

# always print first paragraph
print scalar <DATA> ;

while (<DATA>)
{
    print 
        unless /^\d\d:\d\d.*#file\s*$/s;
}

__DATA__
01:01
keep this text even though it has #file

another paragraph with no data

01:00
delete this #file

07:057
delete this as well #file

keep this paragraph

02:33
keep me

running produces this output

01:01
keep this text even though it has #file

another paragraph with no data

keep this

02:33
keep me

CodePudding user response:

Here's a solution in Perl. As you haven't shown us what you've tried, I'm not going to explain how it works. But it's written as a Unix filter.

#!/usr/bin/perl

use strict;
use warnings;

my $buffer;

while (<>) {
  if (/^\d\d:\d\d\s*$/) {
    end_of_para($buffer);

    $buffer = $_;
  } else {
    $buffer .= $_
  }
}

end_of_para($buffer);

sub end_of_para {
  my ($para) = @_;

  if ($para and $para !~ /#file\s*\z/) {
    print $para;
  }
}

Update: You've changed the sample input file. That makes things far simpler.

#!/usr/bin/perl

use strict;
use warnings;

local $/ = '';

while (<>) {
  print unless /#file\s*\z/;
}

CodePudding user response:

I would harness GNU AWK for this task following way, let file.txt content be

00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No

19:30
-Paragraph two.
-Can be multiple lines.
-Delete this paragraph. #file

13.30
-Paragraph three. Delete this. #file

then

awk 'BEGIN{RS=""}!/#file/' file.txt

gives output

00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No

Explanation: I set RS (row separator) to empty string which triggers paragraph mode, so rows are separatored by one or more blank lines, then I select rows which do not (!) contains #file. If there is more than one item to keep there will be no blank line between them, if this is desired replace RS="" using RS=ORS="\n\n".

(tested in GNU Awk 5.0.1)

CodePudding user response:

Using any awk in any shell on every Unix box:

$ awk -v RS= -v ORS='\n\n' '!/#file$/' file
00:00
-Paragraph one. Can be multiple
lines. Paragraph one. Don't delete
this paragraph. #No

CodePudding user response:

This might work for you (GNU sed):

sed -En '/^[0-9]{2}:[0-9]{2}/{:a;$!{N;/\n$/!ba};/#file\n?$/!p}' file

Turn on extended regexp's and off implicit printing.

Gather up lines starting with HH:MM and ending in a blank line or end of file. If the last string is not #file print the result.

Repeat.

  • Related