what is the best way to get target substring through regex in bash script-CodePudding

I am creating a script to automate and extract large amounts of text files; Currently, my problem is to get target id from .html files, example below:

 \ \ <body id="some_id" >

what of my script function is to get "some_id" and check it is valid(ID is not allowed to start with a number) otherwise fix this id in .html file and other related files(toc.ncx etc), my main used command is sed(but I think my method is cumbersome), the shell is below:

#!/bin/bash
for var in ./*
do
        if [[ $var =~ .*.html ]]
        then
                if grep -q -E '<body id="[0-9] ' $var
                then
                        ID="$(sed -n -E 's/\ \ <body id="[0-9] (.*?)"\ .*/\1/gp' $var)"
                        echo $ID
                        sed -i -E 's/<body\ id="([0-9] )/<body id="id\1/g' $var
                        sed -i -E "s/$ID/id$ID/g" ./../toc.ncx
                        echo $var
                fi
        fi
done

that means I don't know the ID of html, but I know the rule of ID, example below:

\ \ <body id="123char" >

"123char" is invalid, because ID is not allowed to start with a number, so I need to fix the ID with appending prefix characters, like "idchar", so html become below:

\ \ <body id="idchar" >

At the same time I need to update other file's id(change "123char" to "idchar").

PS: as showed above, this shell is aimed at fixing .epub fix that can't pass epub validator, many e-book converts from mobi to epub with this bug(calibre, convertio...etc)

CodePudding user response：

Parsing html with Regex is not easy nor is the right tool to use.
You can use pup which is a HTML parser.

input

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body>
    <h1>Here is h1 tag</h1>
</body>
</html>

test

pup 'h1 text{}' < index.html

output

Here is h1 tag

For any reason if you prefer to use regex, perl is much more suitable than bash. Given this as an input:

sample 1

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body id="some_id" >
    <h1>this is h1 tag</h1>
</body
</html>

with this perl one-liner

perl -lne '/<body\s id="\K[^"] / && print $&' index.htm

the output would be:

some_id

sample 2

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body id="some_id" >
    <h1 id="number-1">this is h1 tag</h1>
    <h1 id="number-2">this is h1 tag</h1>
    <h1 id="number-3">this is h1 tag</h1>
</body
</html>

Perl one-liner

perl -lne '/<h1\s id="\K[^"] / && print $&' index.html

output

number-1
number-2
number-3

And if you prefer to use grep you can use -P option to apply PCRE (Perl Compatible Regular Expression)

grep -oP '<h1\s id="\K[^"] '  index.html

# output
number-1
number-2
number-3

Using a bash function to get value of an id for a tag:

#!/bin/bash

function match_html_id(){
    {
        local tag=$1;
        local regex="<${tag}\s id=\"\K[^\"] ";
        local filename="$2";
        local result='';

        if grep -P "$regex" "$filename" > /dev/null 2>&1; then
            result=$(grep -oP  $regex $filename);
            echo 'match found';
        else 
            echo 'match not found';
        fi
    } >&2;

    echo $result;
}

declare -r r=$(match_html_id body index.html);
echo r: "'$r'"

output for sample 2 or 1 on body tag

match found
r: 'some_id'

CodePudding user response：

This has been repeated here countless times already; it's a really bad idea to parse/edit HTML with regex! An HTML parser like xidel would be better suited. In fact, with its integrated EXPath File module one single call could be all you need:

$ xidel -se '
  for $x in file:list(.,false(),"*.html")
  where matches(doc($x)//body/@id,"^\d")
  return
  file:write(
    $x,
    x:replace-nodes(
      doc($x)//body/@id,
      function($x){attribute {name($x)} {replace($x,"^\d ","id")}}
    ),
    {"method":"html","indent":true()}
  )
'

file:list(.,false(),"*.html") returns all HTML-files in the current dir.
matches(doc($x)//body/@id,"^\d") restricts that to only those HTML-files with an id attribute's value that starts with a number.
x:replace-nodes( [...] ) replaces the number of that value with the string "id".
file:write( [...] ) replaces the original HTML-file.