I'm trying to scrape a specific table from a page containing multiple tables. The url I'm using includes the subsection where the table is located.
So far I tried scraping all tables and select the one I need manually
wikiurl = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
table_class = "toccolours"
table = soup.find_all('table', table_class) # find all tables
# and pick right one
df=pd.read_html(str(table[15]))
Is it possible to use the information in the url #Strikeforce_Challengers:_Britt_vs._Sayers
to only scrape the table in this section?
CodePudding user response:
You are on the way - Simply split()
url once by #
, last element from result by _
and join()
the elements to use them in the css selector
with :-soup-contains()
:
table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')
Example
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')
pd.read_html(str(table))[0]