Below is a program that makes multiple get requests and writes the response images to my directory. These get requests are meant to be in separate threads, and thus be quicker than w/o threads but I'm not seeing the performance difference.
Printing active_count() shows there are 9 threads created. However, the performance time still takes around 40 seconds whether or not I use threading.
Below is me using threading.
from threading import active_count
import requests
import time
import concurrent.futures
img_urls = [
'https://images.unsplash.com/photo-1516117172878-fd2c41f4a759',
'https://images.unsplash.com/photo-1532009324734-20a7a5813719',
'https://images.unsplash.com/photo-1524429656589-6633a470097c',
'https://images.unsplash.com/photo-1530224264768-7ff8c1789d79',
'https://images.unsplash.com/photo-1564135624576-c5c88640f235',
'https://images.unsplash.com/photo-1541698444083-023c97d3f4b6',
'https://images.unsplash.com/photo-1522364723953-452d3431c267',
'https://images.unsplash.com/photo-1513938709626-033611b8cc03',
'https://images.unsplash.com/photo-1507143550189-fed454f93097',
'https://images.unsplash.com/photo-1493976040374-85c8e12f0c0e',
'https://images.unsplash.com/photo-1504198453319-5ce911bafcde',
'https://images.unsplash.com/photo-1530122037265-a5f1f91d3b99',
'https://images.unsplash.com/photo-1516972810927-80185027ca84',
'https://images.unsplash.com/photo-1550439062-609e1531270e',
'https://images.unsplash.com/photo-1549692520-acc6669e2f0c'
]
t1 = time.perf_counter()
def download_image(img_url):
img_bytes = requests.get(img_url).content
img_name = img_url.split('/')[3]
img_name = f'{img_name}.jpg'
with open(img_name, 'wb') as img_file:
img_file.write(img_bytes)
print(f'{img_name} was downloaded...')
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(download_image, img_urls)
print(active_count())
t2 = time.perf_counter()
print(f'Finished in {t2-t1} seconds')
Below is without threading
def download_image(img_url):
img_bytes = requests.get(img_url).content
img_name = img_url.split('/')[3]
img_name = f'{img_name}.jpg'
with open(img_name, 'wb') as img_file:
img_file.write(img_bytes)
print(f'{img_name} was downloaded...')
for img_url in img_urls:
download_image(img_url)
Could someone explain why this is happening? Thanks
CodePudding user response:
This is the result i got with your piece of code, with start and end time next to the download. The overall time is around the same (on my "normal network", not the slow one i talked in my comment)
The reason is that multiple thread doesn't increase I/O or bandwith, the limitation could also be the website itself. This looks like the issue is not from your code.
EDIT (misleading statement) : as mentionned by MisterMiyagi in the comment below (read his comment, he explain why), it should increase I/O, that's the reason i get 10s increase on a slow network (limited connection on my work lab). This doesn't increase the I/O or bandwith in that specific case (with full bandwith on my "normal" connection), and this may be from a lot of source, but in my opinion, not the code itself.
I also tried with max_workers=5, the same overall time appears.
photo-1516117172878-fd2c41f4a759.jpg was downloaded... 1.0464828 - 1.7136098
photo-1532009324734-20a7a5813719.jpg was downloaded... 1.7140197 - 5.6327612
photo-1524429656589-6633a470097c.jpg was downloaded... 5.6339666 - 8.3146478
photo-1530224264768-7ff8c1789d79.jpg was downloaded... 8.3160157 - 10.474087
photo-1564135624576-c5c88640f235.jpg was downloaded... 10.4749598 - 11.2431941
photo-1541698444083-023c97d3f4b6.jpg was downloaded... 11.2436369 - 15.6939695
photo-1522364723953-452d3431c267.jpg was downloaded... 15.6954112 - 18.3257819
photo-1513938709626-033611b8cc03.jpg was downloaded... 18.3269668 - 21.0607191
photo-1507143550189-fed454f93097.jpg was downloaded... 21.0621265 - 22.2371699
photo-1493976040374-85c8e12f0c0e.jpg was downloaded... 22.2375931 - 26.4375676
photo-1504198453319-5ce911bafcde.jpg was downloaded... 26.4393404 - 28.3477933
photo-1530122037265-a5f1f91d3b99.jpg was downloaded... 28.348679 - 30.4626719
photo-1516972810927-80185027ca84.jpg was downloaded... 30.4636931 - 32.2621345
photo-1550439062-609e1531270e.jpg was downloaded... 32.2628976 - 34.7331719
photo-1549692520-acc6669e2f0c.jpg was downloaded... 34.7341393 - 35.5910094
Finished in 34.545366900000005 seconds
21
photo-1516117172878-fd2c41f4a759.jpg was downloaded... 35.5960486 - 46.1692758
photo-1564135624576-c5c88640f235.jpg was downloaded... 35.6110777 - 47.3780254
photo-1507143550189-fed454f93097.jpg was downloaded... 35.6265503 - 47.4433963
photo-1549692520-acc6669e2f0c.jpg was downloaded... 35.6692061 - 49.7097683
photo-1516972810927-80185027ca84.jpg was downloaded... 35.6420564 - 57.2326763
photo-1504198453319-5ce911bafcde.jpg was downloaded... 35.6340008 - 61.4597509
photo-1550439062-609e1531270e.jpg was downloaded... 35.6637577 - 62.0488296
photo-1530224264768-7ff8c1789d79.jpg was downloaded... 35.6072146 - 63.4139648
photo-1513938709626-033611b8cc03.jpg was downloaded... 35.6223106 - 63.8149815
photo-1524429656589-6633a470097c.jpg was downloaded... 35.6032493 - 63.8284464
photo-1530122037265-a5f1f91d3b99.jpg was downloaded... 35.6352735 - 65.0513042
photo-1522364723953-452d3431c267.jpg was downloaded... 35.6182243 - 65.5005548
photo-1532009324734-20a7a5813719.jpg was downloaded... 35.5994888 - 66.2930857
photo-1541698444083-023c97d3f4b6.jpg was downloaded... 35.6144996 - 67.8115219
photo-1493976040374-85c8e12f0c0e.jpg was downloaded... 35.6301133 - 68.5357319
Finished in 32.946069800000004 seconds
EDIT 2 (more testing) : I tried with one of my webserver (Same code, just different image list), and I got an overall decrease of 60-70% of downloading time. Work best with limited workers in that case. The problem come from the website, not your code.
CodePudding user response:
I can see some performance improvement when using multiprocessing package.
import multiprocessing
from multiprocessing import Pool
def download_image(img_url: str) -> None:
img_bytes = requests.get(img_url).content
img_name = img_url.split('/')[3]
img_name = f'{img_name}.jpg'
with open(img_name, 'wb') as img_file:
img_file.write(img_bytes)
print(f'{img_name} was downloaded...')
if __name__ == '__main__':
t1 = time.perf_counter()
with Pool(processes=multiprocessing.cpu_count() - 1 or 1) as pool:
pool.map(download_image, img_urls)
t2 = time.perf_counter()
print(f'Finished in {t2 - t1} seconds')