I scraped a bunch of <a href html tags with urls from a website using bs4 and wrote them into a .txt file. Now I need to extract only certain url's from the text file. The URL's in question have a similiar format:
25: <ahttps://www.recenzijefilmova.com/posljednja-noc-u-sohou-recenzija-filma-2021/
26: <ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/
27: <ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/
28: <ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
29: <ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
30: <ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/
31: <ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/
so as you can see, I would need a regex to match the "-recenzija-filma" part of the URL as it is the only constant
import re
textfile = open("urlovi.txt", "r", encoding="utf-8")
textfile = textfile.read()
b=re.findall(r'(https?://\S )', textfile)
print(b)
c = open("urlovit.txt", "w", encoding="utf-8")
for rijec in b:
print(rijec)
c.write(str(rijec))
c.closed
So far I've found the
re.findall(r'(https?://\S )'
regex, but it doesn't fit my needs, I've managed to clear up data a bit but not enough. So I would need someone to adapt the regex to fit the format I mentioned above. Or if anyone has a better solution using bs4 or sth.
CodePudding user response:
I hope this helps you!
The link of this question: Regular expression to find URLs within a string
Also there are some URLs:
www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd¶ms2=kjhdkjshd
The code below catches all urls in text and returns urls in list.
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.] \.[\w/\-&?=%.] ', text)
print(urls)
And this will be the output:
[
'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string',
'www.google.com',
'facebook.com',
'http://test.com/method?param=wasd',
'http://test.com/method?param=wasd¶ms2=kjhdkjshd'
]
CodePudding user response:
import re
s = [
"<ahttps://www.recenzijefilmova.com/posljednja-noc-u-sohou-recenzija-filma-2021/",
"<ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/",
"<ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/",
]
for i in s:
_,c = re.search('(https?:\/\/\S \/)(\S \/)',i).groups()
print(c)
output
posljednja-noc-u-sohou-recenzija-filma-2021/
pod-hipnozom-recenzija-filma-2021/
pod-hipnozom-recenzija-filma-2021/
ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/
ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/
maybe you can use ()
to catch the group
CodePudding user response:
Thank you all for trying to help! In the meanwhile I found a bs4 method that worked perfectly
The code in the end is the following:
from bs4 import BeautifulSoup
a = open("urlovi.txt", "r", encoding="utf-8")
a = a.read()
soup = BeautifulSoup(a)
atag = soup.find_all("a")
links = [i["href"] for i in atag]
import re
for i in links:
if re.match(r".*(-recenzija-filma|-recenzija-flma|-recenzija|-recenzije-filmova|-recenzije-filma|recenzija).*",i):
print(i)
c.write(str(i)
I got a clean output with everything I needed.