How to save images from url into mongodb using python?-CodePudding

I have used wikipedia package to get a list of image urls from any wikipedia page:

import wikipedia
et_page = wikipedia.page("Summer")
images = et_page.images

Now, I want to save all the images from the images variable to a mongodb in a collection named images.

import pymongo
from PIL import Image
import io

client = pymongo.MongoClient("mongodb srv://<>:<>@cluster0.lfrg6.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")

database_name = 'test'
database = client[database_name]

collection = 'images'
image_collection = database[collection]

Is there any way to do it? As there are multiple images can they be saved in a list format?

CodePudding user response：

It is not best to use MongoDB as an arbitrary blob data store esp. for large images. Thumbnails and small infographics are fine. But the OP seeks to understand how it could be done and the best way is to use gridFS. gridFS is part of the pymongo environment so if you can import pymongo you can import gridfs. Here is a working example:

import wikipedia
import pymongo
import gridfs
from urllib.request import urlopen

connstr = "mongodb://yourInfoHere"
client = pymongo.MongoClient(connstr)

database = client.testX

# This will create two collections that are under control                                           
# of the gridfs object, images.chunks and images.files.  Do                                         
# not go to these collections directly; use the gridfs                                              
# methods instead. The choice of "images" is arbitrary; you
# can use any name you wish.  gridfs will add .chunks and .files
# to the real collection names.
#  Docs are here
#  https://pymongo.readthedocs.io/en/stable/api/gridfs/index.html#module-gridfs                                                                                  
gfs = gridfs.GridFS(database, collection="images")

page_name = "Summer"
print("capturing URLs to images on page",page_name)
et_page = wikipedia.page(page_name)
images = et_page.images

n = 0
for ii in images:
    print("processing",ii)
    f = urlopen(ii)
    # put() "inserts" the file-like object into the gfs subsystem                                   
    # and returns an ID.                                                                            
    file_id = gfs.put(f)

    # Make up a name and capture it AND the gridfs ID in a                                          
    # regular collection, called imageMeta here but it is                                           
    # any name you like.  It is not strictly necessary to do this
    # and it is completely separate from gridFS but you will almost 
    # always have a need to capture some metadata around the pix.                                                                            
    name = "IMAGE_"   str(n)
    database.imageMeta.insert_one({"name":name, "fileId":file_id})
    n  = 1

# Here is an alternate solution where only 1 imageMeta doc is written                               
# but with arrays of image info.  You STILL need to push each image                                 
# individually into gridfs:                                                                         
n = 0
info = []
for ii in images:
    print("processing",ii)
    f = urlopen(ii)
    # put() "inserts" the file-like object into the gfs subsystem                                   
    # and returns an ID.                                                                            
    file_id = gfs.put(f)

    # Make up a name and capture it AND the gridfs ID in a                                          
    # regular collection, called imageMeta here but it is                                           
    # any name you like.                                                                            
    name = "IMAGE_"   str(n)
    info.append({"name":name, "fileId":file_id})
    n  = 1

database.imageMeta.insert_one({"page":page_name, "imageInfo":info});


# Here is how you can get your images out.  Let's pick                                             
# IMAGE_0 for example but obviously any query criteria on the                                       
# imageMeta docs is valid:                                                                          
doc = database.imageMeta.find_one({"name":"IMAGE_0"});
gg = gfs.get(doc['fileId'])

with open('foo.jpg', 'wb ') as wf:
    wf.write(gg.read())  # Nice read/write slurp