I know how to filter based on exact id with the help of the attrs parameter.
tables = pd.read_html(url, attrs={"id": "box-CHI-game-basic"})
I don't know the exact ID in advance, I know the structure of it. I could capture id with regex:
re.search(". -game-basic", "box-CHI-game-basic")
It doesn't work if you just add the regex as the value of the attr.
The match parameter of read_html can use regex, but it goes through the whole text, I would like narrow it down to the id.
CodePudding user response:
I don't think you can do this with pandas_html
.
The match
parameter will match:
The set of tables containing text matching this regex or string
As for attrs
, what you're attempting won't work becasue
attrs
is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly
So, I guess you'd have to resort to bs4
first, for example:
soup.find_all("table", id=re.compile(". -game-basic")
And then, pass the table to pandas
for further parsing.