Hi I am new to stackoverflow.
I am trying to scrape a the table which comes under the heading "Import VAT and excise" from this website for the commodity code"1704906500". I know for sure that the table will fall under "Import VAT and excise" . I have several commodity codes and I will be looping through all the codes. The problem here is I am not able to point to the table under "Import VAT and excise " for scraping.
Please advice?
Weblink Scraping Webpage
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
comCode="1704906500"
url = "https://www.trade-tariff.service.gov.uk/commodities/" comCode "?currency=GBP#import"
url_request = requests.get(url).text
soup=BeautifulSoup(url_request, "lxml")
for header in soup.find_all('h3', text=re.compile('Import VAT and excise')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print((nextNode))
#comm_table = pd.read_html(nextNode.text, attrs = {"table class":"small-table measures govuk-table"} )
CodePudding user response:
You could use .find_next('table')
based on the selection of your heading:
soup.find('h3', text=re.compile('Import VAT and excise')).find_next('table')
or as alternative with css selectors
:
soup.select_one('h3:-soup-contains("Import VAT and excise")').find_next('table')
Example
Iterate over a list of comCodes
and concat all the tables to one dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup
comCode=["1704906500"]
data = []
for c in comCode:
url = f'https://www.trade-tariff.service.gov.uk/commodities/{c}?currency=GBP#import'
soup=BeautifulSoup(requests.get(url).text)
data.append(pd.read_html(str(soup.select_one('h3:-soup-contains("Import VAT and excise")').find_next('table')))[0])
pd.concat(data)