Python get string from an html page-CodePudding

I have to create an array which contains all the element within title="", for example:

  title="xxxxx", title="xxx2", title='xxx4', etc...

I need to get xxxx, xxx2, xxx4

I have used this script the get the html page:

  import requests
  import bs4

  # URL
  URL = "https://en.wikipedia.org/wiki/Main_Page"

  # sending the request
  response = requests.get(URL)

  # parsing the response
  soup = bs4.BeautifulSoup(response.text, 'html')

by printing soup, we can have the complete html file. Now I would like to get all the element wihtin soup that are within the string

  title="".

CodePudding user response：

Another common option other than bs4 is using regular expressions, pythons built in "re" module.

To answer your question directly, I pulled this quote from the documentation, located at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ :

Running the “three sisters” document through Beautiful Soup gives us a >BeautifulSoup object, which represents the document as a nested data structure:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p >
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p >
#    Once upon a time there were three little sisters; and their names were
#    <a  href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a  href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a  href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p >
#    ...
#   </p>
#  </body>
# </html>

Here are some simple ways to navigate that data structure:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p ><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a  href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a  href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a  href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a  href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a  href="http://example.com/tillie" id="link3">Tillie</a>

CodePudding user response：

To get all elements with a title attribute you could use e.g. css selectors:

soup.select('[title]')

[<link href="/w/api.php?action=featuredfeed&amp;feed=potd&amp;feedformat=atom" rel="alternate" title="Wikipedia picture of the day feed" type="application/atom xml"/>, <link href="/w/api.php?action=featuredfeed&amp;feed=featured&amp;feedformat=atom" rel="alternate" title="Wikipedia featured articles feed" type="application/atom xml"/>, <link href="/w/api.php?action=featuredfeed&amp;feed=onthisday&amp;feedformat=atom" rel="alternate" title='Wikipedia "On this day..." feed' type="application/atom xml"/>, <link href="/w/opensearch_desc.php" rel="search" title="Wikipedia (en)" type="application/opensearchdescription xml"/>, <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a>, <a href="/wiki/Free_content" title="Free content">free</a>, <a href="/wiki/Encyclopedia" title="Encyclopedia">encyclopedia</a>, <a href="/wiki/Help:Introduction_to_Wikipedia" title="Help:Introduction to Wikipedia">anyone can edit</a>, <a href="/wiki/Special:Statistics" title="Special:Statistics">6,491,366</a>, <a href="/wiki/English_language" title="English language">English</a>, <a href="/wiki/Battle_of_Oroscopa" title="Battle of Oroscopa">Battle of Oroscopa</a>, <a href="/wiki/Ancient_Carthage" title="Ancient Carthage">Carthaginian</a>,...]

To create a list, with all values of these element title attributes:

[t.get('title') for t in soup.select('[title]')]

['Wikipedia picture of the day feed', 'Wikipedia featured articles feed', 'Wikipedia "On this day..." feed', 'Wikipedia (en)', 'Wikipedia', 'Free content', 'Encyclopedia', 'Help:Introduction to Wikipedia', 'Special:Statistics', 'English language', 'Battle of Oroscopa', 'Ancient Carthage', 'Hasdrubal the Boetharch', 'Numidia', 'Masinissa', 'Roman Republic', 'Carthage', 'Third Punic War', 'Battle of Oroscopa', 'Paige Bueckers', 'Uroš Drenović', '1921–22 Cardiff City F.C. season', "Wikipedia:Today's featured article/April 2022", 'mail:daily-article-l', 'Wikipedia:Featured articles', 'Martin Fehérváry', 'Martin Fehérváry', 'Swedish Hockey League', '1917 Odessa City Duma election', 'Odessa', 'List of people from Manchester', 'Geko (rapper)', 'K Koke', 'Elvis Costello', 'My Aim Is True', 'Bab el-Gasus', 'Jaega Wise', 'James Blunt', 'The Persistence of Chaos', 'Dave Frederick', 'Sussex County, Delaware', 'Wikipedia:Recent additions', 'Help:Your first article', 'Template talk:Did you know', 'Elon Musk in 2018', 'Twitter',...]

To avoid duplicates use a set:

set(t.get('title') for t in soup.select('[title]'))