I'm new to Python and BS. I have a text file where each line is in the following format. I want to extract the image urls from these lines using BS. This is just a text file and not in html format.
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
The following code doesn't do anything and just hangs; how do I fix this?
def readFile(fileName):
with open(fileName, 'r') as fp:
soup = BeautifulSoup(fp.read(),'html.parser')
images = soup.findAll('img')
print("images: ", images)
for image in images:
print (image['src'])
readFile("./imagefile.txt")
CodePudding user response:
Since your input data is not in html format, I don't think BeautifulSoup is the way to go, though I will be happy to be wrong about that. I would start with the re
module as a first step.
import re
text = '''
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
'''
for url in re.findall(r"<img[^>]* src=\"([^\"]*)\"[^>]*>", text):
print(url)
Should give you:
https://example.com/img1.jpg
https://example.com/img2.jpg
About the Pattern: <img[^>]* src=\"([^\"]*)\"[^>]*>
<img | matches the characters "<img" literally
[^>]* | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes before src)
src=\" | matches the characters " src=\"" literally
( | start the capture group
[^\"]* | matches any character that is not the closing quote (between zero and unlimited times)
) | end the capture group
\" | matches the closing quote
[^>]* | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes after src)
> | the closing tag