Home > Back-end >  Why is XPath contains(text(),'substring') not working as expected?
Why is XPath contains(text(),'substring') not working as expected?

Time:11-11

Let's say I have a piece of HTML like this:

<a>Ask Question<other/>more text</a>

I can match this piece of XPath:

//a[text() = 'Ask Question']

Or...

//a[text() = 'more text']

Or I can use dot to match the whole thing:

//a[. = 'Ask Questionmore text']

This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:

//a[contains(text(), 'Ask Question')]

...I get the following error:

Error: Required cardinality of first argument of contains() is one or zero

How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?

CodePudding user response:

For this markup,

<a>Ask Question<other/>more text</a>

notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").

Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:

  1. contains(x,y) expects x to be a string, but text() matches two text nodes.
  2. In XPath 1.0, the rule for converting multiple nodes to a string is this:

A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned. [Emphasis added]

  1. In XPath 2.0 , it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.

In your case...

  • XPath 1.0 would treat contains(text(),'Ask Question') as

    contains('Ask Question','Ask Question')
    

    which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.

  • XPath 2.0 would treat it as an error.

Better alternatives

  • If the goal is to find all a elements whose string value contains the substring,
    "Ask Question":

    //a[contains(.,'Ask Question')]
    

    This is the most common requirement.

  • If the goal is to find all a elements with an immediate text node child equal to "Ask Question":

    //a[text()='Ask Question']
    

    This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,

    <a>Ask Question<other/>more text</a>
    

    but not this a:

    <a>more text before <not>Ask Question</not> more text after</a>
    

See also

CodePudding user response:

The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)

//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.

//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.

So both of these expressions match the same a element.

You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

  • Related