Home > Back-end >  How to extract all image urls from local text file?
How to extract all image urls from local text file?

Time:12-13

I'm new to Python and BS. I have a text file where each line is in the following format. I want to extract the image urls from these lines using BS. This is just a text file and not in html format.

something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >

The following code doesn't do anything and just hangs; how do I fix this?

def readFile(fileName):
    with open(fileName, 'r') as fp:
        soup = BeautifulSoup(fp.read(),'html.parser')
        images = soup.findAll('img')
        print("images: ", images)
        
        for image in images:
            print (image['src'])
        
readFile("./imagefile.txt")

CodePudding user response:

Since your input data is not in html format, I don't think BeautifulSoup is the way to go, though I will be happy to be wrong about that. I would start with the re module as a first step.

import re

text = '''
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
'''

for url in re.findall(r"<img[^>]* src=\"([^\"]*)\"[^>]*>", text):
    print(url)

Should give you:

https://example.com/img1.jpg
https://example.com/img2.jpg

About the Pattern: <img[^>]* src=\"([^\"]*)\"[^>]*>

<img    | matches the characters "<img" literally
[^>]*   | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes before src)
 src=\" | matches the characters " src=\"" literally
(       | start the capture group
[^\"]*  | matches any character that is not the closing quote (between zero and unlimited times)
)       | end the capture group
\"      | matches the closing quote
[^>]*   | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes after src)
>       | the closing tag
  • Related