Would you please explain why this code does not generate what I was expecting? Why is the leading white space between the spans not excluded from the text and the trailing white space is excluded--{ leading space}
instead of{leading space}
? Or how do I exclude both the leading and trailing white space from the captured group?
set v {<span class='wj'> leading space </span> <span class='wj'> leading space </span> }
set vlist [regexp -all -inline -- {<span class='wj'>[[:space:]]*?(. ?)[[:space:]]*?</span>} $v]
# Result: {<span class='wj'> leading space </span>} { leading space} {<span class='wj'> leading space </span>} { leading space}
# Expectation/Goal: {<span class='wj'> leading space </span>} {leading space} {<span class='wj'> leading space </span>} {leading space}
If there is only one span it works without the ?
s after [[:space:]]*
. For multiple spans, if ?
is used instead of *?
for the leading space, it works, unless there isn't a leading space which does not match at all; and I'm not certain all instances will have a leading space. Thus, I assume it has to do with greediness with a *
but I don't understand it.
Thank you.
set v {<span class='wj'> leading space </span>}
set vlist [regexp -all -inline -- {<span class='wj'>[[:space:]]*(. ?)[[:space:]]*</span>} $v]
# {<span class='wj'> leading space </span>} {leading space}
CodePudding user response:
Mixed greediness REs are deep voodoo due to the way Tcl's automata-theoretic RE engine works. I don't really understand it, and I've read that source code! (I think it's something to do with a particular automaton only being able to operate in either greedy or non-greedy mode, but I could be wrong.)
The trick to making things actually work is to keep to one mode and force things in other ways. I've replaced [[:space:]]
with its shorthand form, \s
, and \S
is the complemented form ([^[:space:]]
, but who wants to write all that?)
set vlist [regexp -all -inline -- {<span class='wj'>\s*?(\S.*?)\s*?</span>} $v]
With your sample input, that sets vlist
to:
{<span class='wj'> leading space </span>} {leading space} {<span class='wj'> leading space </span>} {leading space}
which is the thing you wanted.
Do not do general parsing of HTML or XML with regular expressions. Use a proper library designed for the task. But scraping stuff out of a particular page for a few days is fine; the insane churn on the web won't matter then.
If you were to use tDOM, for example, you'd do this sort of parsing with:
package require tdom
set doc [dom parse -html $inputDocument]; # NB: WHOLE document, not fragment
foreach item [$doc selectNodes {span[@class='wj']}] { # Use XPath to get the bit you want
# You have to manually trim leading and trailing whitespace
puts [string trim [$item asText]]
}