I am trying to scrape data from the following website https://www.ecfr.gov/on/2022-04-08/title-21/chapter-I/subchapter-E/part-556/subpart-B/section-556.50
Note that there is a nest structure (Tolerances -> Cattle -> Liver and muscle). This is also one of many sections in this legislation.
There is a "Developer Tools" option, but I am having trouble keeping the nested structure https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50
I would like to convert this html to a pandas dataframe, keeping the nest structure. For example:
h4 | Indent-2 | Indent-3 |
---|---|---|
Amprolium | (1) Cattle | (i) Liver, kidney, and muscle: 0.5 ppm. |
The problem is that class "Indent-3" should be nested in "Indent-2", which should be nested in "h4". I can create the desired data by specifying each class name, but if I want to loop through the sections, I don't want to have to specify each class name.
Is there a more general way (without specifying the class name) to produce the dataframe? This my code so far.
import requests
from bs4 import BeautifulSoup
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
df =pd.DataFrame()
title = soup.find("h4").text
id2 = soup.find("div", attrs = {"id":"p-556.50(b)(1)"}).find(attrs = {"class":"indent-2"}).text
id3 = soup.find("div", attrs = {"id":"p-556.50(b)(1)(i)"}).find(attrs = {"class":"indent-3"}).text
df = pd.DataFrame(data = {"h4":[title],
"indent-2":[id2],
"indent-3":[id3]})
CodePudding user response:
You looking for something like this?
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
title = soup.find("h4").text
indents = soup.find_all(attrs = {'class':re.compile("^indent-")})
row = {'h4':[title]}
for indent in indents:
print(indent)
key = 'indent-' indent['class'].split('-')[-1]
if key not in row.keys():
row[key] = []
row[key].append(indent.text.strip())
df = pd.concat([pd.DataFrame({k:v}) for k,v in row.items()], axis=1)
Output:
print(df.to_string())
h4 indent-1 indent-2 indent-3 indent-4
0 § 556.50 Amprolium. (a) [Reserved] (1) Cattle. (i) Liver, kidney, and muscle: 0.5 ppm. (A) Egg yolks: 8 ppm.
1 NaN (b) Tolerances. The tolerances for amprolium are: (2) Chickens and turkeys. (ii) Fat: 2.0 ppm. (B) Whole eggs: 4 ppm.
2 NaN (c) Related conditions of use. See §§ 520.100, 558.55, and 558.58 of this chapter. (3) Pheasants. (i) Liver and kidney: 1 ppm. NaN
3 NaN NaN NaN (ii) Muscle: 0.5 ppm. NaN
4 NaN NaN NaN (iii) Eggs: NaN
5 NaN NaN NaN (i) Liver: 1 ppm. NaN
6 NaN NaN NaN (ii) Muscle: 0.5 ppm. NaN
CodePudding user response:
In order to browse through the nested div, the idea is to use the children parameter. While @chitown88 answer might solve your issue and look cleaner. Here is an answer using findChildren()
and nested loops.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556§ion=556.50"
r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
df = pd.DataFrame()
title = soup.find("h4").text
id2 = []
id3 = []
id = soup.find('div', {"class": "section"}).get('id')
divGlobal = soup.find('div', {'id': 'p-' id "(b)"})
for lvl1 in divGlobal.findChildren("div", recursive=False): # (1) level
for lvl2 in lvl1.findChildren("div", recursive=False): # (i) level
if len(lvl2.findChildren("div", recursive=False)) > 0:
for lvl3 in lvl2.findChildren("div", recursive=False): # (A) level (eggs in this example)
id2.append(lvl1.findChildren("p")[0].text)
id3.append(lvl3.findChildren("p")[0].text)
else:
id2.append(lvl1.findChildren("p")[0].text)
id3.append(lvl2.findChildren("p")[0].text)
df = pd.DataFrame(
{"h4": [title for i in range(len(id2))],
"indent-2": id2,
"indent-3": id3
}
)
I'm deeply sorry for the poor variable names, I have no idea what your data represents.
Output :
h4 indent-2 indent-3
0 § 556.50 Amprolium. (1) Cattle. (i) Liver, kidney, and muscle: 0.5 ppm.
1 § 556.50 Amprolium. (1) Cattle. (ii) Fat: 2.0 ppm.
2 § 556.50 Amprolium. (2) Chickens and turkeys. (i) Liver and kidney: 1 ppm.
3 § 556.50 Amprolium. (2) Chickens and turkeys. (ii) Muscle: 0.5 ppm.
4 § 556.50 Amprolium. (2) Chickens and turkeys. (A) Egg yolks: 8 ppm.
5 § 556.50 Amprolium. (2) Chickens and turkeys. (B) Whole eggs: 4 ppm.
6 § 556.50 Amprolium. (3) Pheasants. (i) Liver: 1 ppm.
7 § 556.50 Amprolium. (3) Pheasants. (ii) Muscle: 0.5 ppm.