Home > Software design >  Python regex matching URLs
Python regex matching URLs

Time:11-22

I have a list in text file of URLs with some unwanted texts I have wrote a regex which will meet my needs and that is work fine but I am facing a problem where the regex add to the output unwanted samples [''] below the examples:

file content a list of URLs:

http://www.example.com/52                   (Status: 403) [Size: 919]
http://www.example.com/details              (Status: 403) [Size: 919]
http://www.example.com/h                    (Status: 403) [Size: 919]
http://www.example.com/affiliate            (Status: 403) [Size: 919]
http://www.example.com/56                   (Status: 403) [Size: 919]

the regex I used is: "^[://.a-zA-Z0-9-_]*"

the output as below:

['http://www.example.com/52']
['http://www.example.com/details']
['http://www.example.com/h']
['http://www.example.com/affiliate']
['http://www.example.com/56']

I need the output to be like the following:

http://www.example.com/52
http://www.example.com/details
http://www.example.com/h
http://www.example.com/affiliate
http://www.example.com/56

the code used for this program below:

import re

with open("test.txt","r") as test:
    for i in test:
        x = re.findall("^[://.a-zA-Z0-9-_]*",i)
        print(x)

CodePudding user response:

findall produces a list of strings, you can either print out the first element in the result print(x[0]) or just use match instead for this use case since there is 1 url per line.

with open("test.txt","r") as test:
    for i in test:
        x = re.match(r"[://.a-zA-Z0-9-_]*", i)
        print(x.group(0))
  • Related