Scrape multi-tables on websitepages-CodePudding

I have been having serious problems for the last 2 months because I can't manage to scrape the different tables of the following page in a correct way: (the goal is to have the data in a CSV file) [page]((4-chlorophenoxy)acetic acid 3) [Site]

I tried in no-code (webscraper, octoparse and others...) or with Python (pandas, Beautifulsoup...) but it doesn't give me anything usable in CSV. Does anyone have a solution to help me?

  link = "https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm"
    f = urllib.request.urlopen(link)
    html_doc= f.read()



   soup = BeautifulSoup(html_doc)
    #print(soup)
    pages = soup.find_all("a")
    filtred_pages = []
    for p in pages :
     if(p.has_attr('href') and p['href'].startswith("Report")):
       filtred_pages.append(p)

    #pages = list(filter(lambda el: el['href'].startswith("Report"), pages))
    print(filtred_pages)

    result = []
    for page in filtred_pages[:100]:
    f = urllib.request.urlopen('http://sitem.herts.ac.uk/aeru/ppdb/en/' page['href'])
    html_doc= f.read()
    soup = BeautifulSoup(html_doc)
    #print(soup)
    titreNode = soup.find_all("td", attrs={"class" : "title"})[0].text
      trs = soup.select('table.report_data tr')
      rowDict = {}
      for tr in trs:
    if len(tr.select('td.row_header')) == 0:
      continue
    if len(tr.select('td.data1')) == 0:
      continue
    tdTitre = tr.select('td.row_header')[0].text
    tdValue = tr.select('td.data1')[0].text.replace('&nbsp','').rstrip('\n').strip()
    rowDict[tdTitre] = tdValue
  result.append(rowDict)
       #print(result)
       df = pd.DataFrame(result)
       df.to_csv('file.csv')enter code here

CodePudding user response：

There is an awful lot of tables on each page. Here is a way to look at what tables are being located by pandas, and if any of them usable, that particular df can be saved as csv or excel file:

import pandas as pd
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)

url = 'https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm'

r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links = [(x.get_text(strip=True), 'https://sitem.herts.ac.uk/aeru/ppdb/en/'   x.get('href')) for x in soup.select('a[target="_top"]') if 'Reports' in x.get('href')]
df = pd.DataFrame(links, columns = ['Title', 'Url'])
print(df.shape)
for url in df.Url[:2]:
    r = s.get(url)
    dfs = pd.read_html(r.text)
    for df in dfs:
        print(df)
        print('another table:')
    print(f'done {url} _______________')

This will return:

(1904, 2)
0   1
0   PPDB: Pesticide Properties DataBase NaN
another table:
0   1   2   3   4
0   Top Environmental Fate  Ecotoxicology   Human Health    Translations
another table:
0
0   Home
1   A to Z: All
2   A to Z: Insecticides A to Z: Herbicides A to Z: Fungicides A to Z: Other product constituents
3   Search
4   Support information
5   Edit history
6   Purchasing and licensing
another table:
0   1   2
0   (4-chlorophenoxy)acetic acid    Last updated: 04/03/2022    NaN
1   Last updated: 04/03/2022    NaN NaN
another table:
0
0   4-CPA is a plant growth regulator. It is highly soluble in water and is quite volatile. Little is known about its persistence in soil and aquatic systems. It is moderately toxic to mammals and has a high potential for bioaccumulation. It is also a recognised irritant. It is moderately toxic to birds, honeybees and most aquatic organisms. However, it has a relatively low toxicity to fish.
another table:
Environmental fate  Ecotoxicity Human health
0   Environmental fateHigh alert:Drainflow: Mobile Warning:Significant data are missing EcotoxicityModerate alert:Daphnia acute ecotoxicity: Moderate Warning:Significant data are missing  Human healthModerate alert:Mammals acute toxicity: Moderate; Possible Carcinogen; Possible Reproduction/development effects
another table:
0   1
0   GENERAL INFORMATION NaN
another table:
0   1
0   Description A plant growth regulator used fruit setting and thining
1   Example pests controlled    Growth - in terms of thinging and fruit set
2   Example applications    Tomatoes; Grapes
3   Efficacy & activity -
4   Availability status Current
5   Introduction & key dates    Current
another table:
0   1
0   UK COPR regulatory status   Not approved
1   Date COPR inclusion expires Expired
2   UK LERAP status No UK approval for use
another table:
0   1   2   3   4   5   6   7   8
0   EC Regulation 1107/2009 status  Not approved    NaN NaN NaN NaN NaN NaN NaN
1   Dossier rapporteur/co-rapporteur    Not applicable  NaN NaN NaN NaN NaN NaN NaN
2   Date EC 1107/2009 inclusion expires Expired NaN NaN NaN NaN NaN NaN NaN
3   EU Candidate for substitution (CfS) Not applicable  NaN NaN NaN NaN NaN NaN NaN
4   Listed in EU database   Yes NaN NaN NaN NaN NaN NaN NaN
5   Approved for use (✓) under EC 1107/2009 in the following EU Member States   ATAustria BEBelgium BGBulgaria CYCyprus CZCzech Republic DEGermany DKDenmark EEEstonia ELGreece &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp ESSpain FIFinland FRFrance HRCroatia HUHungary IEIreland ITItaly LTLithuania LULuxembourg &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp LVLatvia MTMalta NLNetherlands PLPoland PTPortugal RORomania SESweden SISlovenia SKSlovakia &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp &nbsp NaN NaN NaN NaN NaN NaN NaN
6   ATAustria   BEBelgium   BGBulgaria  CYCyprus    CZCzech Republic    DEGermany   DKDenmark   EEEstonia   ELGreece
7   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp
8   ESSpain FIFinland   FRFrance    HRCroatia   HUHungary   IEIreland   ITItaly LTLithuania LULuxembourg
9   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp
10  LVLatvia    MTMalta NLNetherlands   PLPoland    PTPortugal  RORomania   SESweden    SISlovenia  SKSlovakia
11  &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp
another table:
ATAustria   BEBelgium   BGBulgaria  CYCyprus    CZCzech Republic    DEGermany   DKDenmark   EEEstonia   ELGreece
0   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp
1   ESSpain FIFinland   FRFrance    HRCroatia   HUHungary   IEIreland   ITItaly LTLithuania LULuxembourg
2   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp
3   LVLatvia    MTMalta NLNetherlands   PLPoland    PTPortugal  RORomania   SESweden    SISlovenia  SKSlovakia
4   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp   &nbsp
another table:
0   1
0   Also used in    -
another table:
0   1
0   Isomerism   None
1   Chemical formula    C₈H₇ClO₃
2   Canonical SMILES    C1=CC(=CC=C1OCC(=O)O)Cl
3   Isomeric SMILES No data
4   International Chemical Identifier key (InChIKey)    SODPIMGUZLOIPE-UHFFFAOYSA-N
5   International Chemical Identifier (InChI)   InChI=1S/C8H7ClO3/c9-6-1-3-7(4-2-6)12-5-8(10)11/h1-4H,5H2,(H,10,11)
6   2D structure diagram/image available?   Yes
another table:
0   1
0   Cambridge Crystallographic Data Centre diagrams NaN
another table:
Common Name Relationship    Link
0   Common Name Relationship    Link
1   (4-chlorophenoxy)acetic acid    -   NaN
another table:
0   1
0   Pesticide type  Plant growth regulator, Herbicide
1   Substance groups    Auxin PGR
2   Minimum active substance purity -
3   Known relevant impurities   -
4   Substance origin    Synthetic
5   Mode of action  Auxin-transport inhibitor. Absorbed by tissue and translocated to roots
6   CAS RN  122-88-3
7   EC number   204-581-3
8   CIPAC number    291
9   US EPA chemical code    019401
10  PubChem CID 26229
11  Molecular mass  186.59
12  PIN (Preferred Identification Name) (4-chlorophenoxy)acetic acid
13  IUPAC name  4-chlorophenoxyacetic acid
14  CAS name    (4-chlorophenoxy)acetic acid
15  Other status information    -
16  Relevant Environmental Water Quality Standards  -
17  Herbicide Resistance Classification (HRAC)  P
18  Herbicide Resistance Classification (WSSA)  19
19  Insecticide Resistance Classification (IRAC)    Not applicable
20  Fungicide Resistance Classification (FRAC)  Not applicable
21  Examples of recorded resistance -
22  Physical state  Off-white crystalline solid
another table:
Property    Value
0   Property    Value
1   Example manufacturers & suppliers of products using this active now or historically A. H. MarksHockley International Ltd.
2   Example products using this active  PoltomatMarks 4-CPA
3   Formulation and application details Available in a range of formulations including aerosols, liquids and tablets
another table:
0   1
0   ENVIRONMENTAL FATE  NaN
another table:
Property    Property.1  Value   Source; quality score; and other information    Interpretation
0   Property    Property    Value   Source; quality score; and other information    Interpretation
1   Solubility - In water at 20 °C (mg l⁻¹) Solubility - In water at 20 °C (mg l⁻¹) 957 CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source   High
2   Solubility - In organic solvents at 20 °C (mg l⁻¹)  Solubility - In organic solvents at 20 °C (mg l⁻¹)  -   -   -
3   Melting point (°C)  Melting point (°C)  157 CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source   -
4   Boiling point (°C)  Boiling point (°C)  Decomposes before boiling   V3 V = ChemID online databases / IPCS INCHEM (ChemID; IPCS INCHEM )3 = Unverified data of known source  -
5   Degradation point (°C)  Degradation point (°C)  -   -   -
6   Flashpoint (°C) Flashpoint (°C) -   -   -
7   Octanol-water partition coefficient at pH 7, 20 °C  P   1.78 X 1002 Calculated  -
8   Octanol-water partition coefficient at pH 7, 20 °C  Log P   2.25    CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source   Low
9   Fat solubility of residues  Solubility  -   -   -
10  Fat solubility of residues  Data type   -   -   -
11  Density (g ml⁻¹)    Density (g ml⁻¹)    1.52    L3 L = Pesticide manuals and hard copy reference books / other sources3 = Unverified data of known source   -
12  Dissociation constant pKa) at 25 °C Dissociation constant pKa) at 25 °C 3.01    DW4 DW = Don Wauchope personal database for Pka data: Wauchope, R. D. and Edwards, J. Dissociation constants for pesticide active ingredients: a database and comparison with predicted values. MS in preparation4 = Verified data  -
13  Dissociation constant pKa) at 25 °C Dissociation constant pKa) at 25 °C Weak acid   Weak acid   Weak acid
14  Vapour pressure at 20 °C (mPa)  Vapour pressure at 20 °C (mPa)  2.40 X 10-02    L3 L = Pesticide manuals and hard copy reference books / other sources3 = Unverified data of known source   Low volatility
15  Henry's law constant at 25 °C (Pa m³ mol⁻¹) Henry's law constant at 25 °C (Pa m³ mol⁻¹) 6.48 X 10-08    CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source   Non-volatile
16  Maximum UV-vis absorption L mol⁻¹ cm⁻¹  Maximum UV-vis absorption L mol⁻¹ cm⁻¹  -   -   -
17  Surface tension (mN m⁻¹)    Surface tension (mN m⁻¹)    -   -   -
another table:
Property    Property.1  Value   Source; quality score; and other information    Interpretation
0   Property    Property    Value   Source; quality score; and other information    Interpretation
1   General biodegradability    General biodegradability    -   -   -
2   Soil degradation (days) (aerobic)   DT₅₀ (typical)  -   -   -
3   Soil degradation (days) (aerobic)   DT₅₀ (lab at 20 °C) -   -   -
4   Soil degradation (days) (aerobic)   DT₅₀ (field)    -   -   -
5   Soil degradation (days) (aerobic)   DT₉₀ (lab at 20 °C) -   -   -
6   Soil degradation (days) (aerobic)   DT₉₀ (field)    -   -   -
7   Soil degradation (days) (aerobic)   DT₅₀ modelling endpoint -   -   -
8   Soil degradation (days) (aerobic)   Note    -   -   -
9   Dissipation rate RL₅₀ on plant matrix   Value   -   -   -
10  Dissipation rate RL₅₀ on plant matrix   Note    -   -   -
11  Dissipation rate RL₅₀ on and in plant matrix    Value   -   -   -
12  Dissipation rate RL₅₀ on and in plant matrix    Note    -   -   -
13  Aqueous photolysis DT₅₀ (days) at pH 7  Value   -   -   -
14  Aqueous photolysis DT₅₀ (days) at pH 7  Note    -   -   -
15  Aqueous hydrolysis DT₅₀ (days) at 20 °C and pH 7    Value   -   -   -
16  Aqueous hydrolysis DT₅₀ (days) at 20 °C and pH 7    Note    -   -   -
17  Water-sediment DT₅₀ (days)  Water-sediment DT₅₀ (days)  -   -   -
18  Water phase only DT₅₀ (days)    Water phase only DT₅₀ (days)    -   -   -
another table:
Property    Property.1  Value   Source; quality score; and other information    Interpretation
0   Property    Property    Value   Source; quality score; and other information    Interpretation
1   Linear  Kd  -   CA2 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )2 = Unverified data of unknown source Mobile
2   Linear  Koc 18  CA2 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )2 = Unverified data of unknown source Mobile
3   Linear  Notes and range Estimated   Estimated   Estimated
4   Freundlich  Kf  -   -   -
5   Freundlich  Kfoc    -   -   -
6   Freundlich  1/n -   -   -
7   Freundlich  Notes and range -   -   -
8   pH sensitivity  pH sensitivity  -   -   -
another table:
Property    Value   Source; quality score; and other information    Interpretation
Metabolite name and reference   Aliases Formation medium / Rate Estimated maximum occurrence fraction   Metabolising enzymes
Property    Property    Value   Source; quality score; and other information    Interpretation
Property    Property    Value   Source; quality score; and other information    Interpretation
Property    Property    Value   Source; quality score; and other information    Interpretation
Property    Value and interpretation    Unnamed: 2_level_5  Unnamed: 3_level_5  Unnamed: 4_level_5
Language    Name    Unnamed: 2_level_6  Unnamed: 3_level_6  Unnamed: 4_level_6
0   Property    Property    Value   Source; quality score; and other information    Interpretation
1   GUS leaching potential index    GUS leaching potential index    -   -   -
2   SCI-GROW groundwater index (μg l⁻¹) for a 1 kg ha⁻¹ or 1 l ha⁻¹ application rate    Value   Cannot be calculated    -   -
3   SCI-GROW groundwater index (μg l⁻¹) for a 1 kg ha⁻¹ or 1 l ha⁻¹ application rate    Note    -   -   -
4   Potential for particle bound transport index    Potential for particle bound transport index    -   -   -
... ... ... ... ... ...
107 Dutch   -   NaN NaN NaN
108 Record last updated:    04/03/2022  NaN NaN NaN
109 Contact:    [email protected]    NaN NaN NaN
110 Please cite as: Lewis, K.A., Tzilivakis, J., Warner, D. and Green, A. (2016) An international database for pesticide risk assessments and management. Human and Ecological Risk Assessment: An International Journal, 22(4), 1050-1064. DOI: 10.1080/10807039.2015.1133242  NaN NaN NaN
111 © University of Hertfordshire   ALSO AVAILABLE: THE VSDB THE BPDB   www.herts.ac.uk/aeru    NaN NaN
112 rows × 5 columns

another table:
Metabolite name and reference   Aliases Formation medium / Rate Estimated maximum occurrence fraction   Metabolising enzymes
0   Metabolite name and reference   Aliases Formation medium / Rate Estimated maximum occurrence fraction   Metabolising enzymes
1   centrophenoxine -   Plant   -   -
another table:
0   1
0   ECOTOXICOLOGY   NaN
[...]

CodePudding user response：

Putting all the different tables data into to one CSV is hard, but based on your particular selections not impossible. If intention is to put each table in a separat table checkout the approache of @platipus_on_fire.

Example

Note: limited the pages to scrape by [:5] for demonstration, simply remove the list slicing to get all, but be gentle and add some time between your requests

import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm'
headers = {'User-Agent': 'Mozilla/5.0'}
r=requests.get(url, headers = headers)
soup=BeautifulSoup(r.text)

data = []

for a in soup.select('a[href^="Reports"]')[:5]:
    r=requests.get(f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}', headers = headers)
    soup=BeautifulSoup(r.text)
    for row in soup.select('tr:has(td.row_header)'):
        if row.select_one('.data1'):
            data.append({
                'url':f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}',
                'page':soup.select_one('.title').text,
                'table':row.find_parent('table').find_previous_sibling('table').td.get_text(strip=True),
                'title':row.select_one('.row_header').text,
                'value':row.select_one('.data1').get_text('|', strip=True)
            })

pd.DataFrame(data)#.to_csv(...)

Output

url	page	table	title	value
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm	(4-chlorophenoxy)acetic acid	GENERAL INFORMATION	Description	A plant growth regulator used fruit setting and thining
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm	(4-chlorophenoxy)acetic acid	GENERAL INFORMATION	Example pests controlled	Growth - in terms of thinging and fruit set
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm	(4-chlorophenoxy)acetic acid	GENERAL INFORMATION	Example applications	Tomatoes; Grapes
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm	(4-chlorophenoxy)acetic acid	GENERAL INFORMATION	Efficacy & activity	-
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm	(4-chlorophenoxy)acetic acid	GENERAL INFORMATION	Availability status	Current

...

Just in addition if it is the goal to save all the data from each page in a separat CSV - Move data and CSV export into the loop:

for a in soup.select('a[href^="Reports"]')[:5]:
    r=requests.get(f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}', headers = headers)
    soup=BeautifulSoup(r.text)
    data = []
    for row in soup.select('tr:has(td.row_header)'):
        if row.select_one('.data1'):
            data.append({
                'url':f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}',
                'page':soup.select_one('.title').text,
                'table':row.find_parent('table').find_previous_sibling('table').td.get_text(strip=True),
                'title':row.select_one('.row_header').text,
                'value':row.select_one('.data1').get_text('|', strip=True)
            })
    pd.DataFrame(data)#.to_csv(soup.select_one('.title').text, index=False)