I want to write a program that counts the likes of a YouTube channel. This is my code.
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://filmot.com/channel/UCX6OQ3DkcsbYNE6H8uQQuVA")
soup = BeautifulSoup(r.text , "html.parser")
val=soup.find_all("span",attrs={"class":"badge"})
res = re.findall(r"class=\"fa fa-thumbs-up\"></i>(.*)\<" , str(val))
print(res)
But it returns the result.
['404.1K</span>, <span >Entertainment</span>, <span >8m1s</span>, <span >18 Dec 2021</span>, <span ><i aria-hidden="true" ></i>10M</span>, <span ><i aria-hidden="true" ></i>957.2K</span>, <span >Entertainment</span>, <span >12m9s</span>, <span >16 Dec 2021</span>, <span ><i aria-hidden="true" ></i>14.6M</span>, <span ><i aria-hidden="true" ></i>1.4M</span>, <span >Entertainment</span>, <span >12m4s</span>, <span >10 Dec 2021</span>, <span ><i aria-hidden="true" ></i>11.3M</span>, <span ><i aria-hidden="true" ></i>1.1M</span>, <span ><i aria-hidden="true" ></i>5.1K</span>, <span >Entertainment</span>, <span >11m1s</span>, <span >24 Nov 2021</span>, <span ><i aria-hidden="true" ></i>17.5M</span>, <span ><i aria-hidden="true" ></i>2.8M</span>, <span ><i aria-hidden="true" ></i>3.5K</span>, <span >Entertainment</span>, <span >25m41s</span>, <span >29 Oct 2021</span>, <span ><i aria-hidden="true" ></i>17M</span>, <span ><i aria-hidden="true" ></i>2M</span>, <span ><i aria-hidden="true" ></i>6K</span>, <span >Entertainment</span>, <span >4m55s</span>, <span >23 Oct 2021</span>, <span ><i aria-hidden="true" ></i>19.4M</span>, <span ><i aria-hidden="true" ></i>1.4M</span>, <span ><i aria-hidden="true" ></i>12.5K</span>, <span >Entertainment</span>, <span >15m42s</span>, <span >12 Oct 2021</span>, <span ><i aria-hidden="true" ></i>127.7K</span>, <span ><i aria-hidden="true" ></i>15.3K</span>, <span >Entertainment</span>, <span >5m20s</span>, <span >26 Sep 2021</span>, <span ><i aria-hidden="true" ></i>7.7M</span>, <span ><i aria-hidden="true" ></i>777.1K</span>, <span ><i aria-hidden="true" ></i>6.1K</span>, <span >Entertainment</span>, <span >8m2s</span>, <span >04 Sep 2021</span>, <span ><i aria-hidden="true" ></i>48.4M</span>, <span ><i aria-hidden="true" ></i>2.5M</span>, <span ><i aria-hidden="true" ></i>24.1K</span>, <span >Entertainment</span>, <span >12m40s</span>, <span >31 Aug 2021</span>, <span ><i aria-hidden="true" ></i>69.8M</span>, <span ><i aria-hidden="true" ></i>3M</span>, <span ><i aria-hidden="true" ></i>38.6K</span>, <span >Entertainment</span>, <span >19m25s</span>, <span >07 Aug 2021</span>, <span ><i aria-hidden="true" ></i>53.3M</span>, <span ><i aria-hidden="true" ></i>2.2M</span>, <span ><i aria-hidden="true" ></i>29.1K</span>, <span >Entertainment</span>, <span >16m40s</span>, <span >24 Jul 2021</span>, <span ><i aria-hidden="true" ></i>44.6M</span>, <span ><i aria-hidden="true" ></i>1.7M</span>, <span ><i aria-hidden="true" ></i>21.4K</span>, <span >Entertainment</span>, <span >10m45s</span>, <span >10 Jul 2021</span>, <span ><i aria-hidden="true" ></i>42.2M</span>, <span ><i aria-hidden="true" ></i>1.7M</span>, <span ><i aria-hidden="true" ></i>24.1K</span>, <span >Entertainment</span>, <span >11m34s</span>, <span >26 Jun 2021</span>, <span ><i aria-hidden="true" ></i>53.6M</span>, <span ><i aria-hidden="true" ></i>1.8M</span>, <span ><i aria-hidden="true" ></i>30.6K</span>, <span >Entertainment</span>, <span >12m33s</span>, <span >12 Jun 2021</span>, <span ><i aria-hidden="true" ></i>49.5M</span>, <span ><i aria-hidden="true" ></i>1.9M</span>, <span ><i aria-hidden="true" ></i>29.2K</span>, <span ....
I tested it on the regex101.com site and the result was correct. you can see that in this image. enter image description here
CodePudding user response:
If you want to use regex, a positive lookbehind would be best in such case, e.g.
(?<=class=\"fa fa-thumbs-up\"></i>)[\d\w.]
as in res = re.findall(r"(?<=class=\"fa fa-thumbs-up\"></i>)[\d\w.] " , str(val))
. The .*
can be tricky since .
catches any character and *
catches it between zero and unlimited times (it's an example of a greedy regex operator).
CodePudding user response:
No need to use a regex if you are already using BeautifulSoup.
Extract the text from all val
items that contain i
node with fa fa-thumbs-up class
:
for v in val:
if v.find("i", attrs={'class': 'fa fa-thumbs-up'}):
print(v.text)
Or, get them into a list:
values = [v.text for v in val if v.find("i", attrs={'class': 'fa fa-thumbs-up'})]
CodePudding user response:
You can probably skip a lot of the regex and just iterate over the ResultSet and use regex when a simpler match is made:
res = list()
for entry in val:
if "fa-thumbs-up" in str(entry):
tmp = re.search(r"</i>(.*)</span>", str(entry))
if tmp:
res.append(tmp.group(1))
Then:
print(res[:10])
Output:
['404.1K', '957.2K', '1.4M', '1.1M', '2.8M', '2M', '1.4M', '15.3K', '777.1K', '2.5M']