Why does python's regex take all the characters after .* when I type until reach a certain char-CodePudding

I want to write a program that counts the likes of a YouTube channel. This is my code.

import re
import requests 
from bs4 import BeautifulSoup

r = requests.get("https://filmot.com/channel/UCX6OQ3DkcsbYNE6H8uQQuVA")

soup = BeautifulSoup(r.text , "html.parser")
val=soup.find_all("span",attrs={"class":"badge"})
res = re.findall(r"class=\"fa fa-thumbs-up\"></i>(.*)\<" , str(val))

print(res)

But it returns the result.

['404.1K</span>, <span >Entertainment</span>, <span >8m1s</span>, <span >18 Dec 2021</span>, <span ><i aria-hidden="true" ></i>10M</span>, <span ><i aria-hidden="true" ></i>957.2K</span>, <span >Entertainment</span>, <span >12m9s</span>, <span >16 Dec 2021</span>, <span ><i aria-hidden="true" ></i>14.6M</span>, <span ><i aria-hidden="true" ></i>1.4M</span>, <span >Entertainment</span>, <span >12m4s</span>, <span >10 Dec 2021</span>, <span ><i aria-hidden="true" ></i>11.3M</span>, <span ><i aria-hidden="true" ></i>1.1M</span>, <span ><i aria-hidden="true" ></i>5.1K</span>, <span >Entertainment</span>, <span >11m1s</span>, <span >24 Nov 2021</span>, <span ><i aria-hidden="true" ></i>17.5M</span>, <span ><i aria-hidden="true" ></i>2.8M</span>, <span ><i aria-hidden="true" ></i>3.5K</span>, <span >Entertainment</span>, <span >25m41s</span>, <span >29 Oct 2021</span>, <span ><i aria-hidden="true" ></i>17M</span>, <span ><i aria-hidden="true" ></i>2M</span>, <span ><i aria-hidden="true" ></i>6K</span>, <span >Entertainment</span>, <span >4m55s</span>, <span >23 Oct 2021</span>, <span ><i aria-hidden="true" ></i>19.4M</span>, <span ><i aria-hidden="true" ></i>1.4M</span>, <span ><i aria-hidden="true" ></i>12.5K</span>, <span >Entertainment</span>, <span >15m42s</span>, <span >12 Oct 2021</span>, <span ><i aria-hidden="true" ></i>127.7K</span>, <span ><i aria-hidden="true" ></i>15.3K</span>, <span >Entertainment</span>, <span >5m20s</span>, <span >26 Sep 2021</span>, <span ><i aria-hidden="true" ></i>7.7M</span>, <span ><i aria-hidden="true" ></i>777.1K</span>, <span ><i aria-hidden="true" ></i>6.1K</span>, <span >Entertainment</span>, <span >8m2s</span>, <span >04 Sep 2021</span>, <span ><i aria-hidden="true" ></i>48.4M</span>, <span ><i aria-hidden="true" ></i>2.5M</span>, <span ><i aria-hidden="true" ></i>24.1K</span>, <span >Entertainment</span>, <span >12m40s</span>, <span >31 Aug 2021</span>, <span ><i aria-hidden="true" ></i>69.8M</span>, <span ><i aria-hidden="true" ></i>3M</span>, <span ><i aria-hidden="true" ></i>38.6K</span>, <span >Entertainment</span>, <span >19m25s</span>, <span >07 Aug 2021</span>, <span ><i aria-hidden="true" ></i>53.3M</span>, <span ><i aria-hidden="true" ></i>2.2M</span>, <span ><i aria-hidden="true" ></i>29.1K</span>, <span >Entertainment</span>, <span >16m40s</span>, <span >24 Jul 2021</span>, <span ><i aria-hidden="true" ></i>44.6M</span>, <span ><i aria-hidden="true" ></i>1.7M</span>, <span ><i aria-hidden="true" ></i>21.4K</span>, <span >Entertainment</span>, <span >10m45s</span>, <span >10 Jul 2021</span>, <span ><i aria-hidden="true" ></i>42.2M</span>, <span ><i aria-hidden="true" ></i>1.7M</span>, <span ><i aria-hidden="true" ></i>24.1K</span>, <span >Entertainment</span>, <span >11m34s</span>, <span >26 Jun 2021</span>, <span ><i aria-hidden="true" ></i>53.6M</span>, <span ><i aria-hidden="true" ></i>1.8M</span>, <span ><i aria-hidden="true" ></i>30.6K</span>, <span >Entertainment</span>, <span >12m33s</span>, <span >12 Jun 2021</span>, <span ><i aria-hidden="true" ></i>49.5M</span>, <span ><i aria-hidden="true" ></i>1.9M</span>, <span ><i aria-hidden="true" ></i>29.2K</span>, <span ....

I tested it on the regex101.com site and the result was correct. you can see that in this image. enter image description here

CodePudding user response：

If you want to use regex, a positive lookbehind would be best in such case, e.g. (?<=class=\"fa fa-thumbs-up\"></i>)[\d\w.] as in res = re.findall(r"(?<=class=\"fa fa-thumbs-up\"></i>)[\d\w.] " , str(val)). The .* can be tricky since . catches any character and * catches it between zero and unlimited times (it's an example of a greedy regex operator).

CodePudding user response：

No need to use a regex if you are already using BeautifulSoup.

Extract the text from all val items that contain i node with fa fa-thumbs-up class:

for v in val:
    if v.find("i", attrs={'class': 'fa fa-thumbs-up'}):
        print(v.text)

Or, get them into a list:

values = [v.text for v in val if v.find("i", attrs={'class': 'fa fa-thumbs-up'})]

CodePudding user response：

You can probably skip a lot of the regex and just iterate over the ResultSet and use regex when a simpler match is made:

res = list()
for entry in val:
    if "fa-thumbs-up" in str(entry):
        tmp = re.search(r"</i>(.*)</span>", str(entry))
        if tmp:
            res.append(tmp.group(1))

Then:

print(res[:10])

Output:

['404.1K', '957.2K', '1.4M', '1.1M', '2.8M', '2M', '1.4M', '15.3K', '777.1K', '2.5M']