I want only h3[0] and h6[1], for example.
<div >
<h3>Shroot, Stephanie</h3>
<h6>Chemistry</h6>
<h6>December 2021</h6>
<p>Thesis or dissertation
<h3>Shroot</h3>
i use BeautifulSoup, and for loop to get information
url = line.strip()
r_html = requests.get(url, headers=headers).text
r_html_sc = requests.get(url, headers=headers).status_code
soup = BeautifulSoup(r_html, "html.parser")
thesis_infos = soup.find('div',{"class":"span16"})
if thesis_infos is not None:
thesis_infos_text = thesis_infos.text.strip()
else: thesis_infos_1 = " "
print(thesis_infos_text)
thesis_infos_lines = thesis_infos_text.readlines()
author1_1 = thesis_infos_lines[0]
year1_1 = thesis_infos_lines[2]
CodePudding user response:
Edit: The easiest way is probably to use BeautifulSoup, like so:
soup.find_all("h3")[0]
soup.find_all("h6")[1]
Here is a short example, filtering for links on google.com:
import requests as requests
from bs4 import BeautifulSoup
html = requests.get("https://www.google.com").text
soup = BeautifulSoup(html, "html.parser")
links = soup.findAll("a")
print(links[0])
Is this what you are looking for?
import re
code = """
<div >
<h3>Shroot, Stephanie</h3>
<h6>Chemistry</h6>
<h6>December 2021</h6>
<p>Thesis or dissertation
<h3>Shroot</h3>
"""
h3_matches = re.findall(".*<h3>(. )<\\/h3>", code)
h6_matches = re.findall(".*<h6>(. )<\\/h6>", code)
print(h3_matches[0])
print(h6_matches[1])
output:
Shroot, Stephanie
December 2021
CodePudding user response:
thesis_infos = soup.find('div',{"class":"span16"})
code = str(thesis_infos)
h3_matches = re.findall(".*<h3>(. )<\\/h3>", code)
h6_matches = re.findall(".*<h6>(. )<\\/h6>", code)
print(h3_matches[0])
print(h6_matches[1])