I am creating a script to automate and extract large amounts of text files; Currently, my problem is to get target id from .html files, example below:
\ \ <body id="some_id" >
what of my script function is to get "some_id" and check it is valid(ID is not allowed to start with a number) otherwise fix this id in .html file and other related files(toc.ncx etc), my main used command is sed(but I think my method is cumbersome), the shell is below:
#!/bin/bash
for var in ./*
do
if [[ $var =~ .*.html ]]
then
if grep -q -E '<body id="[0-9] ' $var
then
ID="$(sed -n -E 's/\ \ <body id="[0-9] (.*?)"\ .*/\1/gp' $var)"
echo $ID
sed -i -E 's/<body\ id="([0-9] )/<body id="id\1/g' $var
sed -i -E "s/$ID/id$ID/g" ./../toc.ncx
echo $var
fi
fi
done
that means I don't know the ID of html, but I know the rule of ID, example below:
\ \ <body id="123char" >
"123char" is invalid, because ID is not allowed to start with a number, so I need to fix the ID with appending prefix characters, like "idchar", so html become below:
\ \ <body id="idchar" >
At the same time I need to update other file's id(change "123char" to "idchar").
PS: as showed above, this shell is aimed at fixing .epub fix that can't pass epub validator, many e-book converts from mobi to epub with this bug(calibre, convertio...etc)
CodePudding user response:
Parsing html with Regex is not easy nor is the right tool to use.
You can use pup which is a HTML parser.
input
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<h1>Here is h1 tag</h1>
</body>
</html>
test
pup 'h1 text{}' < index.html
output
Here is h1 tag
For any reason if you prefer to use regex, perl is much more suitable than bash. Given this as an input:
sample 1
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body id="some_id" >
<h1>this is h1 tag</h1>
</body
</html>
with this perl one-liner
perl -lne '/<body\s id="\K[^"] / && print $&' index.htm
the output would be:
some_id
sample 2
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body id="some_id" >
<h1 id="number-1">this is h1 tag</h1>
<h1 id="number-2">this is h1 tag</h1>
<h1 id="number-3">this is h1 tag</h1>
</body
</html>
Perl one-liner
perl -lne '/<h1\s id="\K[^"] / && print $&' index.html
output
number-1
number-2
number-3
And if you prefer to use grep
you can use -P
option to apply PCRE (Perl Compatible Regular Expression)
grep -oP '<h1\s id="\K[^"] ' index.html
# output
number-1
number-2
number-3
Using a bash function to get value of an id for a tag:
#!/bin/bash
function match_html_id(){
{
local tag=$1;
local regex="<${tag}\s id=\"\K[^\"] ";
local filename="$2";
local result='';
if grep -P "$regex" "$filename" > /dev/null 2>&1; then
result=$(grep -oP $regex $filename);
echo 'match found';
else
echo 'match not found';
fi
} >&2;
echo $result;
}
declare -r r=$(match_html_id body index.html);
echo r: "'$r'"
output for sample 2 or 1 on body tag
match found
r: 'some_id'
CodePudding user response:
This has been repeated here countless times already; it's a really bad idea to parse/edit HTML with regex! An HTML parser like xidel would be better suited. In fact, with its integrated EXPath File module one single call could be all you need:
$ xidel -se '
for $x in file:list(.,false(),"*.html")
where matches(doc($x)//body/@id,"^\d")
return
file:write(
$x,
x:replace-nodes(
doc($x)//body/@id,
function($x){attribute {name($x)} {replace($x,"^\d ","id")}}
),
{"method":"html","indent":true()}
)
'
file:list(.,false(),"*.html")
returns all HTML-files in the current dir.matches(doc($x)//body/@id,"^\d")
restricts that to only those HTML-files with anid
attribute's value that starts with a number.x:replace-nodes( [...] )
replaces the number of that value with the string "id".file:write( [...] )
replaces the original HTML-file.