How to extract a fragment of text from file?-CodePudding

I have a file with the following text:

<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>

I want to do the following:

Whenever a line starts with a: <a class='a', I want to copy the text between the > symbol after <a class='a' and </a> — it must be stored in a[1];
Similarly, whenever a line starts with b: <a class='b', I want to copy the text between the > symbol after <a class='b' and </a> — it must be stored in b[1];
Whenever a line contains <div class='start'>, I want to create the variable t whose value starts with the text that occurs between <div class='start'> and the end of this line, then set flag to 1;
If the value of flag is already 1 and the current line does not start with  end, I want to append the current line to the current value of the variable t (using the space symbol as separator);
If the value of flag is already 1 and the current line starts with  end, I want to concatenate three current values of a[1], b[1] and t (using ; as separator) and print the result to the output file, then set flag to 0, then clear the variable t.

I used the following code (for gawk 4.0.1):

gawk 'BEGIN {flag = 0; t = ""; } 
{  
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a);
};
if ($0 ~ /^<b>b:<\/b> <a class=\x27b\x27/ )
{
match($0, /^<b>b:<\/b> <a class=\x27b\x27 href=\x27\/b\/[0-9]{1,}\x27>(.*)<\/a>/, b);
};
if ($0 ~ /<div class=\x27start\x27>/ )
{
match($0, /^.*<div class=\x27start\x27>(.*)$/, s);
t = s[1];
flag = 1;
};
if (flag == 1){
    if ($0 ~ /^<br><br><b>end<\/b>/) 
    {
    str = a[1] ";" b[1] ";" t;
    print(str) > "output.txt";
    flag = 0; str = ""; t = "";
    } 
    else {
    t = t " " $0
    }
}
}' input.txt

I was expecting the following output:

a1;b2;123 <br>ghij. <br>klmn

But the output is:

;;123 <b>d:</b> "ef"<br><br><div class='start'>123 <br>ghij. <br>klmn

Why are a[1] and b[1] empty? Why does d: "ef" <div class='start'> occur in the output? How to fix the code to obtain the expected output?

CodePudding user response：

Demonstrating that gawk's regexes don't match perl's

perl:

$ echo aaaab | perl -nE '/a*(a b)/ && say $1'
ab
$ echo aaaab | perl -nE '/a*?(a b)/ && say $1'
aaaab

a*? matched the shortest sequence of zero or more a's, and the greedy a consumed the rest.

gawk

$ echo aaaab | gawk 'match($0, /a*(a b)/, m) {print m[1]}'
ab
$ echo aaaab | gawk 'match($0, /a*?(a b)/, m) {print m[1]}'
ab

Not the same behaviour: a*? is still greedy.

CodePudding user response：

Why(...)a[1](...)empty?

match function does return 0 if not match was found, which allows to easy check if this is case, I selected part pertaining to filling a-array and altered it a bit

{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}>(.*)<\/a>/, a);
}
}

then used it again

<div>
<b>a:</b> <a class='a' href='/a/1'>a1</a><br>
<b>b:</b> <a class='b' href='/b/2'>b2</a><br>
<b>c:</b> <a class='c' href='/c/3/'>c3</a><br>
<b>d:</b> "ef"<br><br><div class='start'>123
<br>ghij.
<br>klmn
<br><br><b>end</b>
</div>
</div>

and got output

2 0

so condition in if worked as expected as line with a:... is 2nd line, however match was not found. This mean your regular expression is wrong, after examining, your regular expression is missing one single quote, it should be

/^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/

then

{
if ($0 ~ /^<b>a:<\/b> <a class=\x27a\x27/ )
{
print NR, match($0, /^<b>a:<\/b> <a class=\x27a\x27 href=\x27\/a\/[0-9]{1,}\x27>(.*)<\/a>/, a);
print a[1];
}
}

does give output

2 1
a1

(tested in gawk 4.2.1)