I was trying to get genre's from the movie links. The problem is I'm neither good on bs4 nor regex. I tried to use both and checked similar questions in stackoverflow but still couldn't make any progress.
<div data-testid="genres"><a href="/search/title?genres=comedy&explore=title_type,genres&ref_=tt_ov_inf"><span role="presentation">Comedy</span></a><a href="/search/title?genres=drama&explore=title_type,genres&ref_=tt_ov_inf"><span role="presentation">Drama</span></a><a href="/search/title?genres=romance&explore=title_type,genres&ref_=tt_ov_inf"><span role="presentation">Romance</span></a></div>
I'm trying to get Comedy, Drama and Romance in this example. Tried to get the string between role="presentation"> and from regex but I failed. Can you show how can I get these?
CodePudding user response:
You could do it like this:
from bs4 import BeautifulSoup as BS
html = """<div data-testid="genres"><a href="/search/title?genres=comedy&explore=title_type,genres&ref_=tt_ov_inf"><span role="presentation">Comedy</span></a><a href="/search/title?genres=drama&explore=title_type,genres&ref_=tt_ov_inf"><span role="presentation">Drama</span></a><a href="/search/title?genres=romance&explore=title_type,genres&ref_=tt_ov_inf"><span role="presentation">Romance</span></a></div>"""
for span in BS(html, 'lxml').select('span.ipc-chip__text'):
print(span.get_text())
Output:
Comedy
Drama
Romance
CodePudding user response:
html_doc='''
<div data-testid="genres">
<a href="/search/title?genres=comedy&explore=title_type,genres&ref_=tt_ov_inf">
<span role="presentation">
Comedy
</span>
</a>
<a href="/search/title?genres=drama&explore=title_type,genres&ref_=tt_ov_inf">
<span role="presentation">
Drama
</span>
</a>
<a href="/search/title?genres=romance&explore=title_type,genres&ref_=tt_ov_inf">
<span role="presentation">
Romance
</span>
</a>
</div>
'''
from bs4 import BeautifulSoup
soup= BeautifulSoup(html_doc,'html.parser')
#print(soup.prettify())
text = [item.text.strip() for item in soup.select('div[data-testid="genres"] a span')]
print(text)
Output:
['Comedy', 'Drama', 'Romance']