how to find a piece of text between <h3> and </h3> in an html page with python-CodePudding

There is an html page you need to collect the text in the list, which is contained between the h3 and /h3 tags

<h3 id="basics">1. Creating a Web Page</h3>
<p>

Once you've made your "home page" (index.html) you can add more pages to
your site, and your home page can link to them.

<h3 id="syntax">>2. HTML Syntax</h3>

i dont know how to write a pattern for this, pls help to get values "1. Creating a Web Page" and ">2. HTML Syntax"

CodePudding user response：

you can use library like beautifulsoup for crawling webpages.

import requests
from bs4 import BeautifulSoup
html = requests.get('url to your page')
html.encoding = 'utf-8'
sp = BeautifulSoup(html.text, "html5lib")

# to get all h3 in the page
list_h3 = sp.find_all('h3')
for h3 in list_h3:
    print(h3.text)

CodePudding user response：

This should work by eliminating parts of the actual tags

html="<h3 id='basics'>1. Creating a Web Page</h3>"
text=html.replace("<h3","").split(">")[1].split("</")[0]