Trying to scrap images from https://en.wikipedia.org/
website using mechanize gem. I am getting Mechanize::ResponseCodeError (404 => Net::HTTPNotFound for https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/FP2A3620_%2823497688248%29.jpg/119px-FP2A3620_%2823497688248%29.jpg -- unhandled response):
for this when i try to calculate image size.
Here is my code
def images
agent = Mechanize.new
page = agent.get("https://en.wikipedia.org/")
page.images.each do |image|
puts image.url
size = agent.head( image )["content-length"].to_i/1000
end
end
Any help is appreciated.
CodePudding user response:
Looked after that image on wikipedia and it renders just fine. Opened it in a new tab and compared the url from the browser to what mechanize has.
Unescaping the url, did the trick.
image_url = CGI.unescape(image.url.to_s)
size = agent.head(image_url)["content-length"].to_i/1000
Here is a working Replit.