I'm trying to extract the title from a HTML file however I cannot seem to grab it. When I strip the text it only returns the values between > <
.
For example, here's my data output:
< td class = "zentriert" > 1 < /td> <
td class = "zentriert" > 23 < /td> <
td class = "zentriert" > < img alt = "England"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569"
title = "England" / > < /td> <
td class = "zentriert" > < a class = "vereinprofil_tooltip"
href = "/fc-liverpool/startseite/verein/31"
id = "31" > < img alt = "Liverpool FC"
class = ""
src = "https://tmssl.akamaized.net/images/wappen/verysmall/31.png?lm=1456567819"
title = " " / > < /a></td >
<
td class = "zentriert" > 2 < /td> <
td class = "zentriert" > 22 < /td> <
td class = "zentriert" > < img alt = "Morocco"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/107.png?lm=1520611569"
title = "Morocco" / > < br / > < img alt = "Spain"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569"
title = "Spain" / > < /td>
The output is given when I run with my code:
for i in data.data.values[0:10]:
soup = BeautifulSoup(i)
names = soup.find_all('td', class_='zentriert')
for d in names:
print(d)
Say that I want to extract the information on the right of title=
, it should return:
England
" "
Morocco
Spain
But when I try:
for i in data.data.values[0:10]:
soup = BeautifulSoup(i)
names = soup.select('.title') # or soup.find_all('title')
for d in names:
print(d)
I get nothing.
Is there a way to extract that information with the first code I posted, i.e. d['title']
?
CodePudding user response:
.title
means but to get elements with
title=
you needs
select('img[title]')
or find_all('img', {'title': True})
and later you can use d['title']
Minimal working code
html = '''< td class = "zentriert" > 1 < /td> <
td class = "zentriert" > 23 < /td> <
td class = "zentriert" > < img alt = "England"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569"
title = "England" / > < /td> <
td class = "zentriert" > < a class = "vereinprofil_tooltip"
href = "/fc-liverpool/startseite/verein/31"
id = "31" > < img alt = "Liverpool FC"
class = ""
src = "https://tmssl.akamaized.net/images/wappen/verysmall/31.png?lm=1456567819"
title = " " / > < /a></td >
<
td class = "zentriert" > 2 < /td> <
td class = "zentriert" > 22 < /td> <
td class = "zentriert" > < img alt = "Morocco"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/107.png?lm=1520611569"
title = "Morocco" / > < br / > <img alt = "Spain"
class = "flaggenrahmen"
src = "https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569"
title = "Spain" / > < /td>'''.replace('< ', '<')
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
#names = soup.find_all('img', {'title': True})
names = soup.select('img[title]')
for d in names:
print(d['title'])
BTW:
Because in your HTML all images have title=...
so you could even reduce it to select('img')
or find_all('img')