Home > Blockchain >  How do I scrape 3D elements from a html code with BeautifulSoup
How do I scrape 3D elements from a html code with BeautifulSoup

Time:07-07

So I noticed when webscraping, there are sometimes elements looking like this:

<div class='3D"priceInfo-placehold=' data-v-57f40e4a='3D""' er"="">

or like this:

><div data-v-57f40e4a=3D"" class=3D"entertain-box__headSmart">

You can see that in the first one, the 3D hast this ' in front of it. I know how to scrape these elements. My question now is, what does it mean if there is no ' in front of the 3D, how do I scrape it and especially this specific case, as there is a confusing part in front of the class. This is the code and I want to get Magenta TV smart flex all in one going.How do I deal with the "data-v-57f40e4a=3D""" part and how do I get the text?

><div data-v-57f40e4a=3D"" class=3D"entertain-box__headSmart"><span data-v-=
57f40e4a=3D"" class=3D"entertain-box__title">Magenta TV</span><span data-v-=
57f40e4a=3D"" class=3D"entertain-box__subline">smart flex</span></div>

This is my code:

title1 = soup.find_all('div', {'class': '3D"entertain-box__headSmart"'})
for element in title1:
        title = element.get_text()
        print("Found the title: ", title)

Why doesn't it work?

Thank you so much for your help!

CodePudding user response:

So ya this html is tricky/weird. Mind sharing the url? It is possible there could be an api that gets this data from.

But what you could do is use regular expression that looks for the class that contains that sub string.

html = '''<div data-v-57f40e4a=3D"" class=3D"entertain-box__headSmart"><span data-v-= 57f40e4a=3D"" class=3D"entertain-box__title">Magenta TV<span data-v-= 57f40e4a=3D"" class=3D"entertain-box__subline">smart flex'''

from bs4 import BeautifulSoup
import re


soup = BeautifulSoup(html, 'html.parser')
title1 = soup.find_all('div', {'class': re.compile('.*entertain-box__headSmart.*')})

for element in title1:
    title = element.get_text()
    print("Found the title: ", title)

Output:

Found the title:  Magenta TVsmart flex

CodePudding user response:

Simplest way is to replace() the email encoding - check this answer:

soup = BeautifulSoup(html.replace("=3D","="))

Example

from bs4 import BeautifulSoup

html='''
<div data-v-57f40e4a=3D"" class=3D"entertain-box__headSmart"><span data-v-=
57f40e4a=3D"" class=3D"entertain-box__title">Magenta TV</span><span data-v-=
57f40e4a=3D"" class=3D"entertain-box__subline">smart flex</span></div>
'''


soup = BeautifulSoup(html.replace("=3D","="))

for e in soup.select('.entertain-box__headSmart'):
    print(e.span.text)
Output
Magenta TV
  • Related