Home > database >  Getting specific string with BS4 or Re
Getting specific string with BS4 or Re

Time:03-24

I was trying to get genre's from the movie links. The problem is I'm neither good on bs4 nor regex. I tried to use both and checked similar questions in stackoverflow but still couldn't make any progress.

<div data-testid="genres"><a href="/search/title?genres=comedy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf"><span role="presentation">Comedy</span></a><a href="/search/title?genres=drama&amp;explore=title_type,genres&amp;ref_=tt_ov_inf"><span role="presentation">Drama</span></a><a href="/search/title?genres=romance&amp;explore=title_type,genres&amp;ref_=tt_ov_inf"><span role="presentation">Romance</span></a></div>

I'm trying to get Comedy, Drama and Romance in this example. Tried to get the string between role="presentation"> and from regex but I failed. Can you show how can I get these?

CodePudding user response:

You could do it like this:

from bs4 import BeautifulSoup as BS
html = """<div  data-testid="genres"><a  href="/search/title?genres=comedy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf"><span  role="presentation">Comedy</span></a><a  href="/search/title?genres=drama&amp;explore=title_type,genres&amp;ref_=tt_ov_inf"><span  role="presentation">Drama</span></a><a  href="/search/title?genres=romance&amp;explore=title_type,genres&amp;ref_=tt_ov_inf"><span  role="presentation">Romance</span></a></div>"""

for span in BS(html, 'lxml').select('span.ipc-chip__text'):
    print(span.get_text())

Output:

Comedy
Drama
Romance

CodePudding user response:

html_doc='''
<div  data-testid="genres">
 <a  href="/search/title?genres=comedy&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">     
  <span  role="presentation">
   Comedy
  </span>
 </a>
 <a  href="/search/title?genres=drama&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">      
  <span  role="presentation">
   Drama
  </span>
 </a>
 <a  href="/search/title?genres=romance&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">    
  <span  role="presentation">
   Romance
  </span>
 </a>
</div>
'''

from bs4 import BeautifulSoup
soup= BeautifulSoup(html_doc,'html.parser')

#print(soup.prettify())

text = [item.text.strip() for item in soup.select('div[data-testid="genres"] a span')]
print(text)

Output:

['Comedy', 'Drama', 'Romance']
  • Related