How to select one by one the element in web scraping using python-CodePudding

I want only h3[0] and h6[1], for example.

<div >
    <h3>Shroot, Stephanie</h3>
    <h6>Chemistry</h6>
    <h6>December 2021</h6>
    <p>Thesis or dissertation
    <h3>Shroot</h3>

i use BeautifulSoup, and for loop to get information

url = line.strip() 
r_html = requests.get(url, headers=headers).text
r_html_sc = requests.get(url, headers=headers).status_code 
soup = BeautifulSoup(r_html, "html.parser") 
thesis_infos = soup.find('div',{"class":"span16"}) 
if thesis_infos is not None: 
thesis_infos_text = thesis_infos.text.strip() 
else: thesis_infos_1 = " " 
print(thesis_infos_text) 
thesis_infos_lines = thesis_infos_text.readlines() 
author1_1 = thesis_infos_lines[0] 
year1_1 = thesis_infos_lines[2]

CodePudding user response：

Edit: The easiest way is probably to use BeautifulSoup, like so:

soup.find_all("h3")[0]
soup.find_all("h6")[1]

Here is a short example, filtering for links on google.com:

import requests as requests
from bs4 import BeautifulSoup

html = requests.get("https://www.google.com").text
soup = BeautifulSoup(html, "html.parser")
links = soup.findAll("a")
print(links[0])

Is this what you are looking for?

import re

code = """
<div >
    <h3>Shroot, Stephanie</h3>
    <h6>Chemistry</h6>
    <h6>December 2021</h6>
    <p>Thesis or dissertation
    <h3>Shroot</h3>
"""

h3_matches = re.findall(".*<h3>(. )<\\/h3>", code)
h6_matches = re.findall(".*<h6>(. )<\\/h6>", code)
print(h3_matches[0])
print(h6_matches[1])

output:

Shroot, Stephanie
December 2021

CodePudding user response：

    thesis_infos = soup.find('div',{"class":"span16"})
    code = str(thesis_infos)
    h3_matches = re.findall(".*<h3>(. )<\\/h3>", code)
    h6_matches = re.findall(".*<h6>(. )<\\/h6>", code)
    print(h3_matches[0])
    print(h6_matches[1])