I have "help texts" as popup macros on some of our confluence sites and I have to export the sites without these help pop ups as HTML (so attachments are included). My Idea is to run a perl script after the export to remove all help texts which are printed out with the html-tag <aui-inline-dialog></aui-inline-dialog>
. So I need to remove all theses blocks including the content between them from all html files in a folder.
So far I build up these few code lines :
$dirname = '.';
opendir(DIR, $dirname) or die "Could not open $dirname\n";
while ($filename = readdir(DIR)) {
$perl -ne "print unless /^<aui-inline-dialog>/ .. /^<aui-inline-dialog>$/" "$filename\n";
}
closedir(DIR);
Sadly, I'm prompted with the Error "C:\Confluence-space-export-081756-489.html\DATS>perl Perl_Script_for_Removal.pl syntax error at Perl_Script_for_Removal.pl line 5, near "-ne" Execution of Perl_Script_for_Removal.pl aborted due to compilation errors."
I never used Perl before and just can't figure out what how to proceed. I read something about an -i switch for perl, in order to edit the files, but I don't know how to use it correctly. I'm on Windows 10 with the Strawberry Perl command line tool. Any suggestions?
Thanks and greetings from Germany
So I took the advice and solved this Problem with a Python Script instead of Perl. Here is the python Script and thank you all for the input :
import os
from bs4 import BeautifulSoup
directory ='./'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'w') as f:
soup = BeautifulSoup(f.read(),'html.parser')
for tag in soup.findAll(True):
print(tag.name)
if tag.name=='aui-inline-dialog':
i_tag = tag.extract()
f.write(str(soup))
f.close()
CodePudding user response:
You cannot switch to shell in the middle of a Perl script. Keep writing Perl:
my $dirname = '.';
opendir my $dir, $dirname or die "Could not open $dirname: $!\n";
while (my $filename = readdir DIR) {
open my $file, '<', $filename or die "Could not open $filename: $!\n";
while (<$file>) {
print unless /^<aui-inline-dialog>/ .. /^<aui-inline-dialog>$/;
}
}
Also, doesn't the final tag start with </
?
Moreover, using regexes to parse HTML/XML is a wrong idea. Use a parser instead.
CodePudding user response:
Here's a safe way to do it in perl
, using modules (XML::Twig
and Path::Tiny
) included in the Strawberry Perl distribution. It's basically the equivalent of the python version using BeautifulSoup.
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
use Path::Tiny;
use XML::Twig;
my $dir = path(".");
for my $file ($dir->children(qr/\.html$/)) {
say "Processing $file";
XML::Twig->new(
twig_roots => { 'aui-inline-dialog' => sub { $_->delete } },
twig_print_outside_roots => 1
)->parsefile_html_inplace($file);
}