Home > Net >  Extracting title from HTML
Extracting title from HTML

Time:11-04

I'm trying to extract the title from a HTML file however I cannot seem to grab it. When I strip the text it only returns the values between > <.

For example, here's my data output:

< td class = "zentriert" > 1 < /td> <
    td class = "zentriert" > 23 < /td> <
    td class = "zentriert" > < img alt = "England"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569"
title = "England" / > < /td> <
    td class = "zentriert" > < a class = "vereinprofil_tooltip"
href = "/fc-liverpool/startseite/verein/31"
id = "31" > < img alt = "Liverpool FC"
class = ""
src = "https://tmssl.akamaized.net/images/wappen/verysmall/31.png?lm=1456567819"
title = " " / > < /a></td >
    <
    td class = "zentriert" > 2 < /td> <
    td class = "zentriert" > 22 < /td> <
    td class = "zentriert" > < img alt = "Morocco"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/107.png?lm=1520611569"
title = "Morocco" / > < br / > < img alt = "Spain"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569"
title = "Spain" / > < /td>

The output is given when I run with my code:

for i in data.data.values[0:10]:
    soup = BeautifulSoup(i)
    names = soup.find_all('td', class_='zentriert')
    for d in names:
        print(d)

Say that I want to extract the information on the right of title=, it should return:

England
" "
Morocco
Spain

But when I try:

for i in data.data.values[0:10]:
    soup = BeautifulSoup(i)
    names = soup.select('.title') # or soup.find_all('title')
    for d in names:
        print(d)

I get nothing.

Is there a way to extract that information with the first code I posted, i.e. d['title']?

CodePudding user response:

.title means but to get elements with title= you needs

select('img[title]') or find_all('img', {'title': True})

and later you can use d['title']


Minimal working code

html = '''< td class = "zentriert" > 1 < /td> <
    td class = "zentriert" > 23 < /td> <
    td class = "zentriert" > < img alt = "England"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569"
title = "England" / > < /td> <
    td class = "zentriert" > < a class = "vereinprofil_tooltip"
href = "/fc-liverpool/startseite/verein/31"
id = "31" > < img alt = "Liverpool FC"
class = ""
src = "https://tmssl.akamaized.net/images/wappen/verysmall/31.png?lm=1456567819"
title = " " / > < /a></td >
    <
    td class = "zentriert" > 2 < /td> <
    td class = "zentriert" > 22 < /td> <
    td class = "zentriert" > < img alt = "Morocco"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/107.png?lm=1520611569"
title = "Morocco" / > < br / > <img alt = "Spain"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569"
title = "Spain" / > < /td>'''.replace('< ', '<')

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

#names = soup.find_all('img', {'title': True})
names = soup.select('img[title]')

for d in names:
    print(d['title'])

BTW:

Because in your HTML all images have title=... so you could even reduce it to select('img') or find_all('img')

  • Related