Home > Software engineering >  How to scrape table in specific subsection of a page?
How to scrape table in specific subsection of a page?

Time:01-10

I'm trying to scrape a specific table from a page containing multiple tables. The url I'm using includes the subsection where the table is located.

So far I tried scraping all tables and select the one I need manually

wikiurl = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
table_class = "toccolours"
table = soup.find_all('table', table_class) # find all tables
# and pick right one 
df=pd.read_html(str(table[15]))

Is it possible to use the information in the url #Strikeforce_Challengers:_Britt_vs._Sayers to only scrape the table in this section?

CodePudding user response:

You are on the way - Simply split() url once by #, last element from result by _ and join() the elements to use them in the css selector with :-soup-contains():

table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')

Example

import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2011_in_Strikeforce#Strikeforce_Challengers:_Britt_vs._Sayers'
response = requests.get(url)
soup = BeautifulSoup(response.content)

table = soup.select_one(f'h2:-soup-contains("{" ".join(url.split("#")[-1].split("_"))}") ~ .toccolours')

pd.read_html(str(table))[0]
  • Related