Home > Software design >  How to read file of URLs and web scrape them with multithreading
How to read file of URLs and web scrape them with multithreading

Time:07-07

I am implementing a web scraping script in Python that reads a JSON file and gets a list of URLs to scrape each. This file contains over 60K rows of which around 50K are unique (so first I am removing duplicates).

To do this process I have the next:

import contextlib
from bs4 import BeautifulSoup
import feedparser
import pandas
import requests
import time

BASE_URL = 'https://www.iso.org'

def create_iso_details_json(p_merged_iso_df):
    merged_iso_details_df = p_merged_iso_df.drop_duplicates(subset=['Link']).drop(columns=['TC', 'ICS'], axis=1)
    iso_details_dfs = [parse_iso_details(iso, stage, link) 
                       for iso, stage, link in zip(merged_iso_details_df['Standard and/or project'], merged_iso_details_df['Stage'], merged_iso_details_df['Link']) 
                       if link != '']
    merged_iso_details_df = pandas.concat(iso_details_dfs)
    print('Total rows retrieved: ', len(merged_iso_details_df.index))
    merged_iso_details_df.to_json('iso_details.json', orient="records")
    
def parse_iso_details(p_iso, p_stage, p_url):
    print('URL: ', p_url)
    soup = BeautifulSoup(requests.get(p_url).text, 'html.parser')
    try:
        feed_details_url = BASE_URL   soup.find('section', {'id': 'product-details'}).find('a', {'class': 'ss-icon ss-social-circle text-warning text-sm'})['href']
    except AttributeError:
        print('Could not find feed data for URL: ', p_url)
    print(feed_details_url)
    iso_details_dfs = []
    if feed_details_url is not None:
        iso_details_dfs.append(read_iso_details(feed_details_url, p_iso, p_stage))
    with contextlib.suppress(ValueError):
        return pandas.concat(iso_details_dfs)
    
def read_iso_details(p_feed_details_url, p_iso, p_stage):
    data = {'Standard and/or project': p_iso, 'Stage': p_stage}
    df = pandas.DataFrame(data, index=[0])
    feed = feedparser.parse(p_feed_details_url)
    df['Publication date'] = [entry.published for entry in feed.entries]
    return df

def main():
    start_time = time.time()
    merged_iso_df = pandas.read_json('input_file.json', dtype={"Stage": str})
    create_iso_details_json(merged_iso_df)
    print(f"--- {time.time() - start_time} seconds ---")

if __name__ == "__main__":
    main()

I am merging the results in a pandas DataFrame to write it to another JSON file later.

Now, this takes so much time since the process makes a request per each input URL and lasts between 0.5 and 1 seconds.

I would like to implement this process with multithreading (not multiprocessing) so the processing time decreases significatively.

What is the best approach to achieve this? Split the input JSON file into many parts as number of threads to create to processing? How I merge the results of each thread into one to write the output JSON file?

Thank you in advance.

CodePudding user response:

This Website explains multithreading pretty well. What you could do is splitting the URLs into equal parts and running them simultaneously. The problem with that is, that you basically just divide the time it would take by the number of threads you use. But to my knowledge, this is the best thing you can do without overcomplicating it.

CodePudding user response:

I would go with asyncio and aiohttp here is a complete example of how to do multiple requests concurrently and get the result in the end

import aiohttp
import asyncio

async def geturl(url, session):
    async with session.get(url) as resp:
        if resp.status == 200:
            return (await resp.json())['name']
        else:
            return "ERROR"

async def main():
    urls = [f'https://pokeapi.co/api/v2/pokemon/{i}' for i in range(1,10)]
    async with aiohttp.ClientSession() as session:
        tasks = [geturl(url, session) for url in urls]
        # asyncio.gather will run all the tasks concurrently
        # and return their results once all tasks have returned
        all_results = await asyncio.gather(*tasks)
        print(all_results)

asyncio.run(main())

This will print the first 10 pokemon names by the way, you can tweak for your needs

  • Related