Excluding leading and trailing white space in captured group with ? for greediness?-CodePudding

Would you please explain why this code does not generate what I was expecting? Why is the leading white space between the spans not excluded from the text and the trailing white space is excluded--{ leading space} instead of{leading space}? Or how do I exclude both the leading and trailing white space from the captured group?

set v {<span class='wj'> leading space </span> <span class='wj'> leading space </span> }
set vlist [regexp -all -inline -- {<span class='wj'>[[:space:]]*?(. ?)[[:space:]]*?</span>} $v]
# Result: {<span class='wj'> leading space </span>} { leading space} {<span class='wj'> leading space </span>} { leading space}
# Expectation/Goal: {<span class='wj'> leading space </span>} {leading space} {<span class='wj'> leading space </span>} {leading space}

If there is only one span it works without the ?s after [[:space:]]*. For multiple spans, if ? is used instead of *? for the leading space, it works, unless there isn't a leading space which does not match at all; and I'm not certain all instances will have a leading space. Thus, I assume it has to do with greediness with a * but I don't understand it.

Thank you.

set v {<span class='wj'> leading space  </span>}
set vlist [regexp -all -inline -- {<span class='wj'>[[:space:]]*(. ?)[[:space:]]*</span>} $v]
# {<span class='wj'> leading space  </span>} {leading space}

CodePudding user response：

Mixed greediness REs are deep voodoo due to the way Tcl's automata-theoretic RE engine works. I don't really understand it, and I've read that source code! (I think it's something to do with a particular automaton only being able to operate in either greedy or non-greedy mode, but I could be wrong.)

The trick to making things actually work is to keep to one mode and force things in other ways. I've replaced [[:space:]] with its shorthand form, \s, and \S is the complemented form ([^[:space:]], but who wants to write all that?)

set vlist [regexp -all -inline -- {<span class='wj'>\s*?(\S.*?)\s*?</span>} $v]

With your sample input, that sets vlist to:

{<span class='wj'> leading space </span>} {leading space} {<span class='wj'> leading space </span>} {leading space}

which is the thing you wanted.

Do not do general parsing of HTML or XML with regular expressions. Use a proper library designed for the task. But scraping stuff out of a particular page for a few days is fine; the insane churn on the web won't matter then.

If you were to use tDOM, for example, you'd do this sort of parsing with:

package require tdom

set doc [dom parse -html $inputDocument]; # NB: WHOLE document, not fragment
foreach item [$doc selectNodes {span[@class='wj']}] {  # Use XPath to get the bit you want
    # You have to manually trim leading and trailing whitespace
    puts [string trim [$item asText]]
}