Specifying a html nest structure in Python-CodePudding

I am trying to scrape data from the following website https://www.ecfr.gov/on/2022-04-08/title-21/chapter-I/subchapter-E/part-556/subpart-B/section-556.50

Note that there is a nest structure (Tolerances -> Cattle -> Liver and muscle). This is also one of many sections in this legislation.

There is a "Developer Tools" option, but I am having trouble keeping the nested structure https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50

I would like to convert this html to a pandas dataframe, keeping the nest structure. For example:

h4	Indent-2	Indent-3
Amprolium	(1) Cattle	(i) Liver, kidney, and muscle: 0.5 ppm.

The problem is that class "Indent-3" should be nested in "Indent-2", which should be nested in "h4". I can create the desired data by specifying each class name, but if I want to loop through the sections, I don't want to have to specify each class name.

Is there a more general way (without specifying the class name) to produce the dataframe? This my code so far.

import requests
from bs4 import BeautifulSoup

url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50"

r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")
df =pd.DataFrame()

title = soup.find("h4").text
id2 = soup.find("div", attrs = {"id":"p-556.50(b)(1)"}).find(attrs = {"class":"indent-2"}).text
id3 = soup.find("div", attrs = {"id":"p-556.50(b)(1)(i)"}).find(attrs = {"class":"indent-3"}).text

df = pd.DataFrame(data = {"h4":[title],
                          "indent-2":[id2],
                          "indent-3":[id3]})

CodePudding user response：

You looking for something like this?

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50"

r = requests.get(url)
soup = BeautifulSoup(r.content,"xml")

title = soup.find("h4").text
indents = soup.find_all(attrs = {'class':re.compile("^indent-")})

row = {'h4':[title]}

for indent in indents:
    print(indent)
    
    key = 'indent-'   indent['class'].split('-')[-1]
    if key not in row.keys():
        row[key] = []
        
    row[key].append(indent.text.strip())

df = pd.concat([pd.DataFrame({k:v}) for k,v in row.items()], axis=1)

Output:

print(df.to_string())
                    h4                                                                             indent-1                   indent-2                                 indent-3                indent-4
0  § 556.50 Amprolium.                                                                       (a) [Reserved]                (1) Cattle.  (i) Liver, kidney, and muscle: 0.5 ppm.   (A) Egg yolks: 8 ppm.
1                  NaN                                   (b) Tolerances.  The tolerances for amprolium are:  (2) Chickens and turkeys.                       (ii) Fat: 2.0 ppm.  (B) Whole eggs: 4 ppm.
2                  NaN  (c) Related conditions of use.  See §§ 520.100, 558.55, and 558.58 of this chapter.             (3) Pheasants.             (i) Liver and kidney: 1 ppm.                     NaN
3                  NaN                                                                                  NaN                        NaN                    (ii) Muscle: 0.5 ppm.                     NaN
4                  NaN                                                                                  NaN                        NaN                              (iii) Eggs:                     NaN
5                  NaN                                                                                  NaN                        NaN                        (i) Liver: 1 ppm.                     NaN
6                  NaN                                                                                  NaN                        NaN                    (ii) Muscle: 0.5 ppm.                     NaN

CodePudding user response：

In order to browse through the nested div, the idea is to use the children parameter. While @chitown88 answer might solve your issue and look cleaner. Here is an answer using findChildren() and nested loops.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = r"https://www.ecfr.gov/api/renderer/v1/content/enhanced/2022-04-08/title-21?part=556&section=556.50"

r = requests.get(url)
soup = BeautifulSoup(r.content, "xml")
df = pd.DataFrame()

title = soup.find("h4").text
id2 = []
id3 = []

id = soup.find('div', {"class": "section"}).get('id')
divGlobal = soup.find('div', {'id': 'p-'   id   "(b)"})

for lvl1 in divGlobal.findChildren("div", recursive=False):  # (1) level

    for lvl2 in lvl1.findChildren("div", recursive=False):  # (i) level

        if len(lvl2.findChildren("div", recursive=False)) > 0:
            for lvl3 in lvl2.findChildren("div", recursive=False):  # (A) level (eggs in this example)
                id2.append(lvl1.findChildren("p")[0].text)
                id3.append(lvl3.findChildren("p")[0].text)

        else:
            id2.append(lvl1.findChildren("p")[0].text)
            id3.append(lvl2.findChildren("p")[0].text)


df = pd.DataFrame(
    {"h4": [title for i in range(len(id2))],
     "indent-2": id2,
     "indent-3": id3
     }
)

I'm deeply sorry for the poor variable names, I have no idea what your data represents.

Output :

                    h4                    indent-2                                  indent-3
0  § 556.50 Amprolium.                (1) Cattle.   (i) Liver, kidney, and muscle: 0.5 ppm. 
1  § 556.50 Amprolium.                (1) Cattle.                        (ii) Fat: 2.0 ppm. 
2  § 556.50 Amprolium.  (2) Chickens and turkeys.              (i) Liver and kidney: 1 ppm. 
3  § 556.50 Amprolium.  (2) Chickens and turkeys.                     (ii) Muscle: 0.5 ppm. 
4  § 556.50 Amprolium.  (2) Chickens and turkeys.                     (A) Egg yolks: 8 ppm. 
5  § 556.50 Amprolium.  (2) Chickens and turkeys.                    (B) Whole eggs: 4 ppm. 
6  § 556.50 Amprolium.             (3) Pheasants.                         (i) Liver: 1 ppm. 
7  § 556.50 Amprolium.             (3) Pheasants.                     (ii) Muscle: 0.5 ppm.