Home > Software design >  How to extract title of hrefs using BS4?
How to extract title of hrefs using BS4?

Time:01-18

I'm parsing wikipedia, and I need to get title from href on the page. I have this code to get only links,but I have no idea how I can get only titles.

response = requests.get(url=url_start)
        soup = BeautifulSoup(response.content, "html.parser")
        status_code = response.status_code
        if status_code == 200:
            for link in soup.find(id="bodyContent").findAll("a"):
                if "/wiki/" in link['href']:
                    print(link['href'])

CodePudding user response:

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs


Select your elements more specific (may use css selectors) and extract value of attribute with get('title) - In case that there is no title this will give you None:

[a.get('title') for a in soup.select('#bodyContent a[href*="/wiki/"]')]

Example

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://de.wikipedia.org/wiki/Python').content)

[a.get('title') for a in soup.select('#bodyContent a[href*="/wiki/"]')]

Output

['Altgriechische Sprache', 'Python (Mythologie)', 'Pythons', 'Eigentliche Pythons', 'Python (Programmiersprache)', 'Monty Python', 'Python (Schiff, 1935)', 'Peithon', 'Python Vehicles Australia', 'Python (Töpfer)', 'Python (Vasenmaler)', 'Paestanische Vasenmalerei', 'Georges Python', 'Valentine Python', 'Python (Efteling)', 'Colt Python', 'Knicklenker (Fahrrad)', 'Python-3', 'Python-4', 'Python-5', 'wikt:Python', 'Pythia', 'Wikipedia:Begriffsklärung', 'Wikipedia:Kategorien', 'Kategorie:Begriffsklärung']
  • Related