Web scraping of hyperlinks going so slow-CodePudding

I am using the following function to scrape the Twitter URLs from a list of websites.

import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter


def twitter_url(website): # website address is given to the function in a string format

    try:
        http = httplib2.Http()
        status, response = http.request(str('https://')   website)

        url = 'https://twitter.com'
        search_domain = urlparse(url).hostname

        l = []

        for link in bs.BeautifulSoup(response, 'html.parser',
                             parseOnlyThese=SoupStrainer('a')):
            if link.has_attr('href'):
                if search_domain in link['href']:
                    l.append(link['href'])
    
        return list(set(l))
    
    except:
        ConnectionRefusedError

and then I apply the function into the dataframe which includes the URL addresses

df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)

The dataframe has about 100,000 website addresses. Even when I run the code for 10,000 samples, the code is running so slow. Is there any way to run this faster?

CodePudding user response：

The issue must be a result of the time taken to retrieve the HTML code for each of the websites.

Since the URLs are processed one after the other, even if each one took 100ms it would still take 1000s (~16 mins) to finish up.

If you however process each URL in a separate thread, that should significantly cut down the time taken.

You can check out the threading library to accomplish that.

CodePudding user response：

Try to use a threading library. It will be much faster. Also, consider using a library like Tweepy.