Home > OS >  Filtering tables from read_html based on regex attr
Filtering tables from read_html based on regex attr

Time:11-18

I know how to filter based on exact id with the help of the attrs parameter.

tables = pd.read_html(url, attrs={"id": "box-CHI-game-basic"})

I don't know the exact ID in advance, I know the structure of it. I could capture id with regex:

re.search(". -game-basic", "box-CHI-game-basic")

It doesn't work if you just add the regex as the value of the attr.

The match parameter of read_html can use regex, but it goes through the whole text, I would like narrow it down to the id.

CodePudding user response:

I don't think you can do this with pandas_html.

The match parameter will match:

The set of tables containing text matching this regex or string

As for attrs, what you're attempting won't work becasue

attrs is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly

So, I guess you'd have to resort to bs4 first, for example:


soup.find_all("table", id=re.compile(". -game-basic")

And then, pass the table to pandas for further parsing.

  • Related