Home > database >  Parallel-process image scraping with bing_image_downloader in google colab
Parallel-process image scraping with bing_image_downloader in google colab

Time:09-27

I have a pandas data frame (named df) that looks like below:

search_term fname
banana fldr1
kiwi fldr2
coffee. fldr3

and I'm using the following python code to scrape images in Bing using search_term, and save those images in folder named in fname.

from bing_image_downloader import downloader

for index, row in df.iterrows():
    print(row['search_term'])
    downloader.download(row['search_term'], limit=200,  output_dir="FOLDERX", adult_filter_off=True, force_replace=False, timeout=60)
    os.rename(os.path.join("FOLDERX",row['search_term']), os.path.join("FOLDERX",row['fname']))

But I'd like to run this in parallel since I have a lot of search terms to go through. For example, if there are 10 search_term to go through, I'd like the parallel with 2 jobs to split the search terms into 2 and scrape images simultaneously. I'm running this in google colab, and so far have tried

import multiprocessing
from joblib import Parallel, delayed

def scrape_bing(df):
  for index, row in df.iterrows():
    print(row['search_term'])
    downloader.download(row['search_term'], limit=200,  output_dir="FOLDERX", adult_filter_off=True, force_replace=False, timeout=60)
    os.rename(os.path.join("FOLDERX",row['search_term']), os.path.join("FOLDERX",row['fname']))

Parallel(n_jobs=2)(delayed(scrape_bing)(i, j) for i in range(5) for j in range(2))

But I don't know how to modify the arguments in `delayed' to make it work. Help please?

CodePudding user response:

There is no need to use extra indices for iterating, you can simplify the scrape_bing and iterate over dataframe rows in Parallel code.

def scrape_bing(search_term, fname):
    downloader.download(search_term, limit=200,  output_dir="FOLDERX", adult_filter_off=True, force_replace=False, timeout=60)
    os.rename(os.path.join("FOLDERX", search_term), os.path.join("FOLDERX", fname))


Parallel(n_jobs=2)(delayed(scrape_bing)(row['search_term'], row['fname']) for index, row in df.iterrows())
  • Related