lxml xpath exponential performance behavior-CodePudding

I'm trying to use xpath to query a large html with multiple tables and only extract a few tables that contain a specific pattern in one of the cells. I'm running into time related challenges.

I've tried to minimize by issue as much as possible.

code setup: - creates 10 (300x15) tables with random values between 0-100

import pandas as pd
import numpy as np

dataframes = [pd.DataFrame(np.random.randint(0,100, (300, 15)), columns=[f"Col-{i}" for i in range(15)]) for k in range(10)]
html_strings = [df.to_html() for df in dataframes]
combined_html = '\n'.join(html_strings)
source_html = f'<html><body>{combined_html}</body></html>'

code execution: I want to extract all tables that have the value "80" in them (in this case it will be all 10 tables)

from lxml import etree
root = etree.fromstring(source_html.encode())

PAT = '80' # this should result in returning all 10 tables as 80 will definitely be there in all of them (pandas index)

# method 1: query directly using xpath - this takes a long time to run - and this seems to exhibit exponential time behavior
xpath_query = "//table//*[text() = '{PAT}']/ancestor::table"
tables = root.xpath(xpath_query)

# method 2: this runs in under a second. first get all the tables and then run the same xpath expression within the table context
all_tables = root.xpath('//table')
table_xpath_individual = ".//*[text() = '{PAT}']/ancestor::table"
selected_tables = [table for table in all_tables if table.xpath(table_xpath_individual)]

method 1 takes 40-50s to finish method 2 takes <1s

I'm not sure whether it's the xpath expression in method 1 that's problematic or it's an lxml issue here. I've switched to using method 2 for now - but unsure whether it's a 100% equivalent behavior

CodePudding user response：

I don't know if this is relevant (but I suspect so). You can simplify these XPaths by eliminating the //* step and the trailing /ancestor::table step.

//table[descendant::text() = '{PAT}']

Note that in your problematic XPath, for each table you will find every descendant element whose text is 80 (there might be many within a table) and for each one, return all of that element's table ancestors (again, because in theory there might be more than one if you had a table containing a table, the XPath processor is going to have to laboriously traverse all those ancestor paths). That will return a potentially large number of results which the XPath processor will then have to deduplicate, so that it doesn't return multiple instances of any given table (an XPath 1.0 nodeset is guaranteed not to contain duplicates).