I have a file with the following text:
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
I want to do the following:
- Whenever a line starts with
<b>a:</b> <a class='a'
, I want to copy the text between the>
symbol after<a class='a'
and</a>
— it must be stored ina[1]
; - Similarly, whenever a line starts with
<b>b:</b> <a class='b'
, I want to copy the text between the>
symbol after<a class='b'
and</a>
— it must be stored inb[1]
; - Whenever a line contains
<div class='start'>
, I want to create the variablet
whose value starts with the text that occurs between<div class='start'>
and the end of this line, then setflag
to1
; - If the value of
flag
is already1
and the current line does not start with<br><br><b>end</b>
, I want to append the current line to the current value of the variablet
(using the space symbol as separator); - If the value of
flag
is already1
and the current line starts with<br><br><b>end</b>
, I want to concatenate three current values ofa[1]
,b[1]
andt
(using;
as separator) and print the result to the output file, then setflag
to0
, then clear the variablet
.
I used the following code (for gawk 4.0.1
):
gawk 'BEGIN {flag = 0; t = ""; }
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a);
};
if ($0 ~ /^<b>b:<\/b> <a class=\x27b\x27/ )
{
match($0, /^<b>b:<\/b> <a class=\x27b\x27 href=\x27\/b\/[0-9]{1,}\x27>(.*)<\/a>/, b);
};
if ($0 ~ /<div class=\x27start\x27>/ )
{
match($0, /^.*<div class=\x27start\x27>(.*)$/, s);
t = s[1];
flag = 1;
};
if (flag == 1){
if ($0 ~ /^<br><br><b>end<\/b>/)
{
str = a[1] ";" b[1] ";" t;
print(str) > "output.txt";
flag = 0; str = ""; t = "";
}
else {
t = t " " $0
}
}
}' input.txt
I was expecting the following output:
a1;b2;123 <br>ghij. <br>klmn
But the output is:
;;123 <b>d:</b> "ef"<br><br><div class='start'>123 <br>ghij. <br>klmn
Why are a[1]
and b[1]
empty? Why does <b>d:</b> "ef"<br><br><div class='start'>
occur in the output? How to fix the code to obtain the expected output?
CodePudding user response:
Demonstrating that gawk's regexes don't match perl's
perl:
$ echo aaaab | perl -nE '/a*(a b)/ && say $1'
ab
$ echo aaaab | perl -nE '/a*?(a b)/ && say $1'
aaaab
a*?
matched the shortest sequence of zero or more a's, and the greedy a
consumed the rest.
gawk
$ echo aaaab | gawk 'match($0, /a*(a b)/, m) {print m[1]}'
ab
$ echo aaaab | gawk 'match($0, /a*?(a b)/, m) {print m[1]}'
ab
Not the same behaviour: a*?
is still greedy.
CodePudding user response:
Why(...)
a[1]
(...)empty?
match
function does return 0
if not match was found, which allows to easy check if this is case, I selected part pertaining to filling a-array and altered it a bit
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}>(.*)<\/a>/, a);
}
}
then used it again
<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>
and got output
2 0
so condition in if
worked as expected as line with <b>a:</b>
... is 2nd line, however match was not found. This mean your regular expression is wrong, after examining, your regular expression is missing one single quote, it should be
/^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/
then
{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a);
print a[1];
}
}
does give output
2 1
a1
(tested in gawk 4.2.1)