Home > Software design >  Using xmlstarlet non XML compliant documents (XHTML)
Using xmlstarlet non XML compliant documents (XHTML)

Time:10-04

I have non XML compliant documents (XHTML pages) with improperly closed tags,img, br, hr. I need close image, hr, and br tags properly, with '/>' I tried xmlstarlet, it does the job, but alters XML declaration header. So I have original code as follows:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en" lang="en">
    <head>
        <title> </title>
        <link rel="stylesheet" type="text/css" href="style.css" />
    </head>
<body>

if I run command xmlstarlet fo --recover --html file.xhtml, the output is incorrect, have 2 declaration lines:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html>
<?xml version="1.0" encoding="UTF-8" standalone="no"??>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en" lang="en">
    <head>
        <title> </title>
        <link rel="stylesheet" type="text/css" href="style.css"/>
    </head>
<body>

if I run xmlstarlet fo --omit-decl --recover --html file.xhtml, the output is also incorrect, as declaration need be the first line:

<!DOCTYPE html>
<?xml version="1.0" encoding="UTF-8" standalone="no"??>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xml:lang="en" lang="en">
    <head>
        <title> </title>
        <link rel="stylesheet" type="text/css" href="style.css"/>
    </head>
<body>

So I need to do post-processing, swap the first and second lines. What bash command can help here? Please specify command syntax for bath processing files and editing in place. P.S. why xmlstarlet put 2 question mark chars at the end of declaration? ("no"??>)

CodePudding user response:

I suggest to append | sed -n '1{h;d};2{p;g};p'.

CodePudding user response:

This might work for you (GNU sed):

sed -zE 's/(.*)\n(.*)/\2\n\1/m' file

Slurp the file into memory and swap the contents of line 1 and 2.

N.B. The m flag allows .* to refer to lines contents.

  • Related