I have been having serious problems for the last 2 months because I can't manage to scrape the different tables of the following page in a correct way: (the goal is to have the data in a CSV file) [page]((4-chlorophenoxy)acetic acid 3) [Site]
I tried in no-code (webscraper, octoparse and others...) or with Python (pandas, Beautifulsoup...) but it doesn't give me anything usable in CSV. Does anyone have a solution to help me?
link = "https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm"
f = urllib.request.urlopen(link)
html_doc= f.read()
soup = BeautifulSoup(html_doc)
#print(soup)
pages = soup.find_all("a")
filtred_pages = []
for p in pages :
if(p.has_attr('href') and p['href'].startswith("Report")):
filtred_pages.append(p)
#pages = list(filter(lambda el: el['href'].startswith("Report"), pages))
print(filtred_pages)
result = []
for page in filtred_pages[:100]:
f = urllib.request.urlopen('http://sitem.herts.ac.uk/aeru/ppdb/en/' page['href'])
html_doc= f.read()
soup = BeautifulSoup(html_doc)
#print(soup)
titreNode = soup.find_all("td", attrs={"class" : "title"})[0].text
trs = soup.select('table.report_data tr')
rowDict = {}
for tr in trs:
if len(tr.select('td.row_header')) == 0:
continue
if len(tr.select('td.data1')) == 0:
continue
tdTitre = tr.select('td.row_header')[0].text
tdValue = tr.select('td.data1')[0].text.replace(' ','').rstrip('\n').strip()
rowDict[tdTitre] = tdValue
result.append(rowDict)
#print(result)
df = pd.DataFrame(result)
df.to_csv('file.csv')enter code here
CodePudding user response:
There is an awful lot of tables on each page. Here is a way to look at what tables are being located by pandas, and if any of them usable, that particular df can be saved as csv or excel file:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
url = 'https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm'
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links = [(x.get_text(strip=True), 'https://sitem.herts.ac.uk/aeru/ppdb/en/' x.get('href')) for x in soup.select('a[target="_top"]') if 'Reports' in x.get('href')]
df = pd.DataFrame(links, columns = ['Title', 'Url'])
print(df.shape)
for url in df.Url[:2]:
r = s.get(url)
dfs = pd.read_html(r.text)
for df in dfs:
print(df)
print('another table:')
print(f'done {url} _______________')
This will return:
(1904, 2)
0 1
0 PPDB: Pesticide Properties DataBase NaN
another table:
0 1 2 3 4
0 Top Environmental Fate Ecotoxicology Human Health Translations
another table:
0
0 Home
1 A to Z: All
2 A to Z: Insecticides A to Z: Herbicides A to Z: Fungicides A to Z: Other product constituents
3 Search
4 Support information
5 Edit history
6 Purchasing and licensing
another table:
0 1 2
0 (4-chlorophenoxy)acetic acid Last updated: 04/03/2022 NaN
1 Last updated: 04/03/2022 NaN NaN
another table:
0
0 4-CPA is a plant growth regulator. It is highly soluble in water and is quite volatile. Little is known about its persistence in soil and aquatic systems. It is moderately toxic to mammals and has a high potential for bioaccumulation. It is also a recognised irritant. It is moderately toxic to birds, honeybees and most aquatic organisms. However, it has a relatively low toxicity to fish.
another table:
Environmental fate Ecotoxicity Human health
0 Environmental fateHigh alert:Drainflow: Mobile Warning:Significant data are missing EcotoxicityModerate alert:Daphnia acute ecotoxicity: Moderate Warning:Significant data are missing Human healthModerate alert:Mammals acute toxicity: Moderate; Possible Carcinogen; Possible Reproduction/development effects
another table:
0 1
0 GENERAL INFORMATION NaN
another table:
0 1
0 Description A plant growth regulator used fruit setting and thining
1 Example pests controlled Growth - in terms of thinging and fruit set
2 Example applications Tomatoes; Grapes
3 Efficacy & activity -
4 Availability status Current
5 Introduction & key dates Current
another table:
0 1
0 UK COPR regulatory status Not approved
1 Date COPR inclusion expires Expired
2 UK LERAP status No UK approval for use
another table:
0 1 2 3 4 5 6 7 8
0 EC Regulation 1107/2009 status Not approved NaN NaN NaN NaN NaN NaN NaN
1 Dossier rapporteur/co-rapporteur Not applicable NaN NaN NaN NaN NaN NaN NaN
2 Date EC 1107/2009 inclusion expires Expired NaN NaN NaN NaN NaN NaN NaN
3 EU Candidate for substitution (CfS) Not applicable NaN NaN NaN NaN NaN NaN NaN
4 Listed in EU database Yes NaN NaN NaN NaN NaN NaN NaN
5 Approved for use (✓) under EC 1107/2009 in the following EU Member States ATAustria BEBelgium BGBulgaria CYCyprus CZCzech Republic DEGermany DKDenmark EEEstonia ELGreece                   ESSpain FIFinland FRFrance HRCroatia HUHungary IEIreland ITItaly LTLithuania LULuxembourg                   LVLatvia MTMalta NLNetherlands PLPoland PTPortugal RORomania SESweden SISlovenia SKSlovakia                   NaN NaN NaN NaN NaN NaN NaN
6 ATAustria BEBelgium BGBulgaria CYCyprus CZCzech Republic DEGermany DKDenmark EEEstonia ELGreece
7                  
8 ESSpain FIFinland FRFrance HRCroatia HUHungary IEIreland ITItaly LTLithuania LULuxembourg
9                  
10 LVLatvia MTMalta NLNetherlands PLPoland PTPortugal RORomania SESweden SISlovenia SKSlovakia
11                  
another table:
ATAustria BEBelgium BGBulgaria CYCyprus CZCzech Republic DEGermany DKDenmark EEEstonia ELGreece
0                  
1 ESSpain FIFinland FRFrance HRCroatia HUHungary IEIreland ITItaly LTLithuania LULuxembourg
2                  
3 LVLatvia MTMalta NLNetherlands PLPoland PTPortugal RORomania SESweden SISlovenia SKSlovakia
4                  
another table:
0 1
0 Also used in -
another table:
0 1
0 Isomerism None
1 Chemical formula C₈H₇ClO₃
2 Canonical SMILES C1=CC(=CC=C1OCC(=O)O)Cl
3 Isomeric SMILES No data
4 International Chemical Identifier key (InChIKey) SODPIMGUZLOIPE-UHFFFAOYSA-N
5 International Chemical Identifier (InChI) InChI=1S/C8H7ClO3/c9-6-1-3-7(4-2-6)12-5-8(10)11/h1-4H,5H2,(H,10,11)
6 2D structure diagram/image available? Yes
another table:
0 1
0 Cambridge Crystallographic Data Centre diagrams NaN
another table:
Common Name Relationship Link
0 Common Name Relationship Link
1 (4-chlorophenoxy)acetic acid - NaN
another table:
0 1
0 Pesticide type Plant growth regulator, Herbicide
1 Substance groups Auxin PGR
2 Minimum active substance purity -
3 Known relevant impurities -
4 Substance origin Synthetic
5 Mode of action Auxin-transport inhibitor. Absorbed by tissue and translocated to roots
6 CAS RN 122-88-3
7 EC number 204-581-3
8 CIPAC number 291
9 US EPA chemical code 019401
10 PubChem CID 26229
11 Molecular mass 186.59
12 PIN (Preferred Identification Name) (4-chlorophenoxy)acetic acid
13 IUPAC name 4-chlorophenoxyacetic acid
14 CAS name (4-chlorophenoxy)acetic acid
15 Other status information -
16 Relevant Environmental Water Quality Standards -
17 Herbicide Resistance Classification (HRAC) P
18 Herbicide Resistance Classification (WSSA) 19
19 Insecticide Resistance Classification (IRAC) Not applicable
20 Fungicide Resistance Classification (FRAC) Not applicable
21 Examples of recorded resistance -
22 Physical state Off-white crystalline solid
another table:
Property Value
0 Property Value
1 Example manufacturers & suppliers of products using this active now or historically A. H. MarksHockley International Ltd.
2 Example products using this active PoltomatMarks 4-CPA
3 Formulation and application details Available in a range of formulations including aerosols, liquids and tablets
another table:
0 1
0 ENVIRONMENTAL FATE NaN
another table:
Property Property.1 Value Source; quality score; and other information Interpretation
0 Property Property Value Source; quality score; and other information Interpretation
1 Solubility - In water at 20 °C (mg l⁻¹) Solubility - In water at 20 °C (mg l⁻¹) 957 CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source High
2 Solubility - In organic solvents at 20 °C (mg l⁻¹) Solubility - In organic solvents at 20 °C (mg l⁻¹) - - -
3 Melting point (°C) Melting point (°C) 157 CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source -
4 Boiling point (°C) Boiling point (°C) Decomposes before boiling V3 V = ChemID online databases / IPCS INCHEM (ChemID; IPCS INCHEM )3 = Unverified data of known source -
5 Degradation point (°C) Degradation point (°C) - - -
6 Flashpoint (°C) Flashpoint (°C) - - -
7 Octanol-water partition coefficient at pH 7, 20 °C P 1.78 X 1002 Calculated -
8 Octanol-water partition coefficient at pH 7, 20 °C Log P 2.25 CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source Low
9 Fat solubility of residues Solubility - - -
10 Fat solubility of residues Data type - - -
11 Density (g ml⁻¹) Density (g ml⁻¹) 1.52 L3 L = Pesticide manuals and hard copy reference books / other sources3 = Unverified data of known source -
12 Dissociation constant pKa) at 25 °C Dissociation constant pKa) at 25 °C 3.01 DW4 DW = Don Wauchope personal database for Pka data: Wauchope, R. D. and Edwards, J. Dissociation constants for pesticide active ingredients: a database and comparison with predicted values. MS in preparation4 = Verified data -
13 Dissociation constant pKa) at 25 °C Dissociation constant pKa) at 25 °C Weak acid Weak acid Weak acid
14 Vapour pressure at 20 °C (mPa) Vapour pressure at 20 °C (mPa) 2.40 X 10-02 L3 L = Pesticide manuals and hard copy reference books / other sources3 = Unverified data of known source Low volatility
15 Henry's law constant at 25 °C (Pa m³ mol⁻¹) Henry's law constant at 25 °C (Pa m³ mol⁻¹) 6.48 X 10-08 CA3 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )3 = Unverified data of known source Non-volatile
16 Maximum UV-vis absorption L mol⁻¹ cm⁻¹ Maximum UV-vis absorption L mol⁻¹ cm⁻¹ - - -
17 Surface tension (mN m⁻¹) Surface tension (mN m⁻¹) - - -
another table:
Property Property.1 Value Source; quality score; and other information Interpretation
0 Property Property Value Source; quality score; and other information Interpretation
1 General biodegradability General biodegradability - - -
2 Soil degradation (days) (aerobic) DT₅₀ (typical) - - -
3 Soil degradation (days) (aerobic) DT₅₀ (lab at 20 °C) - - -
4 Soil degradation (days) (aerobic) DT₅₀ (field) - - -
5 Soil degradation (days) (aerobic) DT₉₀ (lab at 20 °C) - - -
6 Soil degradation (days) (aerobic) DT₉₀ (field) - - -
7 Soil degradation (days) (aerobic) DT₅₀ modelling endpoint - - -
8 Soil degradation (days) (aerobic) Note - - -
9 Dissipation rate RL₅₀ on plant matrix Value - - -
10 Dissipation rate RL₅₀ on plant matrix Note - - -
11 Dissipation rate RL₅₀ on and in plant matrix Value - - -
12 Dissipation rate RL₅₀ on and in plant matrix Note - - -
13 Aqueous photolysis DT₅₀ (days) at pH 7 Value - - -
14 Aqueous photolysis DT₅₀ (days) at pH 7 Note - - -
15 Aqueous hydrolysis DT₅₀ (days) at 20 °C and pH 7 Value - - -
16 Aqueous hydrolysis DT₅₀ (days) at 20 °C and pH 7 Note - - -
17 Water-sediment DT₅₀ (days) Water-sediment DT₅₀ (days) - - -
18 Water phase only DT₅₀ (days) Water phase only DT₅₀ (days) - - -
another table:
Property Property.1 Value Source; quality score; and other information Interpretation
0 Property Property Value Source; quality score; and other information Interpretation
1 Linear Kd - CA2 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )2 = Unverified data of unknown source Mobile
2 Linear Koc 18 CA2 CA = Medical and toxicological databases and information systems e.g. TOXNET (click here )2 = Unverified data of unknown source Mobile
3 Linear Notes and range Estimated Estimated Estimated
4 Freundlich Kf - - -
5 Freundlich Kfoc - - -
6 Freundlich 1/n - - -
7 Freundlich Notes and range - - -
8 pH sensitivity pH sensitivity - - -
another table:
Property Value Source; quality score; and other information Interpretation
Metabolite name and reference Aliases Formation medium / Rate Estimated maximum occurrence fraction Metabolising enzymes
Property Property Value Source; quality score; and other information Interpretation
Property Property Value Source; quality score; and other information Interpretation
Property Property Value Source; quality score; and other information Interpretation
Property Value and interpretation Unnamed: 2_level_5 Unnamed: 3_level_5 Unnamed: 4_level_5
Language Name Unnamed: 2_level_6 Unnamed: 3_level_6 Unnamed: 4_level_6
0 Property Property Value Source; quality score; and other information Interpretation
1 GUS leaching potential index GUS leaching potential index - - -
2 SCI-GROW groundwater index (μg l⁻¹) for a 1 kg ha⁻¹ or 1 l ha⁻¹ application rate Value Cannot be calculated - -
3 SCI-GROW groundwater index (μg l⁻¹) for a 1 kg ha⁻¹ or 1 l ha⁻¹ application rate Note - - -
4 Potential for particle bound transport index Potential for particle bound transport index - - -
... ... ... ... ... ...
107 Dutch - NaN NaN NaN
108 Record last updated: 04/03/2022 NaN NaN NaN
109 Contact: [email protected] NaN NaN NaN
110 Please cite as: Lewis, K.A., Tzilivakis, J., Warner, D. and Green, A. (2016) An international database for pesticide risk assessments and management. Human and Ecological Risk Assessment: An International Journal, 22(4), 1050-1064. DOI: 10.1080/10807039.2015.1133242 NaN NaN NaN
111 © University of Hertfordshire ALSO AVAILABLE: THE VSDB THE BPDB www.herts.ac.uk/aeru NaN NaN
112 rows × 5 columns
another table:
Metabolite name and reference Aliases Formation medium / Rate Estimated maximum occurrence fraction Metabolising enzymes
0 Metabolite name and reference Aliases Formation medium / Rate Estimated maximum occurrence fraction Metabolising enzymes
1 centrophenoxine - Plant - -
another table:
0 1
0 ECOTOXICOLOGY NaN
[...]
CodePudding user response:
Putting all the different tables data into to one CSV is hard, but based on your particular selections not impossible. If intention is to put each table in a separat table checkout the approache of @platipus_on_fire.
Example
Note: limited the pages to scrape by [:5] for demonstration, simply remove the list slicing to get all, but be gentle and add some time between your requests
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sitem.herts.ac.uk/aeru/ppdb/en/atoz.htm'
headers = {'User-Agent': 'Mozilla/5.0'}
r=requests.get(url, headers = headers)
soup=BeautifulSoup(r.text)
data = []
for a in soup.select('a[href^="Reports"]')[:5]:
r=requests.get(f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}', headers = headers)
soup=BeautifulSoup(r.text)
for row in soup.select('tr:has(td.row_header)'):
if row.select_one('.data1'):
data.append({
'url':f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}',
'page':soup.select_one('.title').text,
'table':row.find_parent('table').find_previous_sibling('table').td.get_text(strip=True),
'title':row.select_one('.row_header').text,
'value':row.select_one('.data1').get_text('|', strip=True)
})
pd.DataFrame(data)#.to_csv(...)
Output
url | page | table | title | value |
---|---|---|---|---|
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm | (4-chlorophenoxy)acetic acid | GENERAL INFORMATION | Description | A plant growth regulator used fruit setting and thining |
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm | (4-chlorophenoxy)acetic acid | GENERAL INFORMATION | Example pests controlled | Growth - in terms of thinging and fruit set |
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm | (4-chlorophenoxy)acetic acid | GENERAL INFORMATION | Example applications | Tomatoes; Grapes |
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm | (4-chlorophenoxy)acetic acid | GENERAL INFORMATION | Efficacy & activity | - |
http://sitem.herts.ac.uk/aeru/ppdb/en/Reports/7.htm | (4-chlorophenoxy)acetic acid | GENERAL INFORMATION | Availability status | Current |
...
Just in addition if it is the goal to save all the data from each page in a separat CSV - Move data
and CSV export into the loop:
for a in soup.select('a[href^="Reports"]')[:5]:
r=requests.get(f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}', headers = headers)
soup=BeautifulSoup(r.text)
data = []
for row in soup.select('tr:has(td.row_header)'):
if row.select_one('.data1'):
data.append({
'url':f'http://sitem.herts.ac.uk/aeru/ppdb/en/{a.get("href")}',
'page':soup.select_one('.title').text,
'table':row.find_parent('table').find_previous_sibling('table').td.get_text(strip=True),
'title':row.select_one('.row_header').text,
'value':row.select_one('.data1').get_text('|', strip=True)
})
pd.DataFrame(data)#.to_csv(soup.select_one('.title').text, index=False)