So I have this IMDB link where i want to extract the genre only so I'm already using this code
```
# genre
genre = movie.find('span',class_="genre")
if genre != None:
genre = str(genre).split(', <p >')[0].replace("\n", "").replace("</p>]", "")
else:
genre = "Not Found"
IMDB_dict[title].append(genre)
To give me output as
Drama,Fantasy,Horror
as seen in picture:
![here][1]
But I want to only output Drama, Fantasy, Horror and not the that stuff above.
May I please know how to do this as I have put some Regex code there to find it but it still returns some kind of URL as seen above.
Appreciate it
[1]: https://i.stack.imgur.com/OEmjy.png
CodePudding user response:
if you're using BeautifulSoup you can use this
genre_text = BeautifulSoup(genre).text
CodePudding user response:
This regex will extract any content between any string containing a single open and closed HTML tag with the capture group (.*)
.
^<.*>(.*)<\/.*>$
Python:
import re
genre = '<span >Drama,Fantasy,Horror </span>'
regx = r'^<.*>(.*)<\/.*>$'
text = re.findall(regx, genre)
print(text)
Output:
['Drama,Fantasy,Horror ']