Home > Software engineering >  Need a regex to extract certain url's from a text file
Need a regex to extract certain url's from a text file

Time:05-09

I scraped a bunch of <a href html tags with urls from a website using bs4 and wrote them into a .txt file. Now I need to extract only certain url's from the text file. The URL's in question have a similiar format:

25: <ahttps://www.recenzijefilmova.com/posljednja-noc-u-sohou-recenzija-filma-2021/
26: <ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/
27: <ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/
28: <ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
29: <ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
30: <ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/
31: <ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/

so as you can see, I would need a regex to match the "-recenzija-filma" part of the URL as it is the only constant

import re
textfile = open("urlovi.txt", "r", encoding="utf-8")
textfile = textfile.read()
b=re.findall(r'(https?://\S )', textfile)
print(b)

c = open("urlovit.txt", "w", encoding="utf-8")
for rijec in b:
    print(rijec)
    c.write(str(rijec))
c.closed

So far I've found the

re.findall(r'(https?://\S )'

regex, but it doesn't fit my needs, I've managed to clear up data a bit but not enough. So I would need someone to adapt the regex to fit the format I mentioned above. Or if anyone has a better solution using bs4 or sth.

CodePudding user response:

I hope this helps you!

The link of this question: Regular expression to find URLs within a string

Also there are some URLs:

www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd&params2=kjhdkjshd

The code below catches all urls in text and returns urls in list.

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.] \.[\w/\-&?=%.] ', text)
print(urls) 

And this will be the output:

[
'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string', 
'www.google.com', 
'facebook.com',
'http://test.com/method?param=wasd',
'http://test.com/method?param=wasd&params2=kjhdkjshd'
]

CodePudding user response:

import re
s = [
"<ahttps://www.recenzijefilmova.com/posljednja-noc-u-sohou-recenzija-filma-2021/",
"<ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/",
"<ahttps://www.recenzijefilmova.com/pod-hipnozom-recenzija-filma-2021/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/",
"<ahttps://www.recenzijefilmova.com/ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/",
]

for i in s:
    _,c = re.search('(https?:\/\/\S \/)(\S \/)',i).groups()
    print(c)

output
posljednja-noc-u-sohou-recenzija-filma-2021/
pod-hipnozom-recenzija-filma-2021/
pod-hipnozom-recenzija-filma-2021/
ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
ulica-straha-3-dio-1666-fear-street-part-three-1666-2021-recenzija-filma/
ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/
ulica-straha-2-dio-1978-fear-street-part-two-2021-recenzija-filma/

maybe you can use () to catch the group

CodePudding user response:

Thank you all for trying to help! In the meanwhile I found a bs4 method that worked perfectly

The code in the end is the following:

from bs4 import BeautifulSoup

a = open("urlovi.txt", "r", encoding="utf-8")
a = a.read()
soup = BeautifulSoup(a)
atag = soup.find_all("a")
links = [i["href"] for i in atag]
import re
for i in links:
    if re.match(r".*(-recenzija-filma|-recenzija-flma|-recenzija|-recenzije-filmova|-recenzije-filma|recenzija).*",i):
                print(i)
                c.write(str(i)

            

I got a clean output with everything I needed.

  • Related