What is the most efficient way to verify multiple email hostnames in a pandas dataframe-CodePudding

I have a pandas data frame, which has a column of the hostname of each email address (over 1000 rows):

email               hostname
[email protected]   example.com
[email protected]  example.com
[email protected]  example2.com
[email protected]  example3.com

I want to go through each hostname and check whether it truly exists or not.

email               hostname      valid_hostname
[email protected]   example.com   True
[email protected]  example.com   False
[email protected]  example2.com  False
[email protected]  example3.com  False

First, I extracted the hostname of each email address:

df['hostname'] = df['email'].str.split('@').str[1]

Then, I tried to check the DNS using pyIsEmail, but that was too slow:

from pyisemail import is_email    
df['valid_hostname'] = df['hostname'].apply(lambda x: is_email(x, check_dns=True))

Then, I tried a multi-threaded function:

import requests
from requests.exceptions import ConnectionError

def validate_hostname_existence(hostname:str):
    try:
        response = requests.get(f'http://{hostname}', timeout=0.5)
    except ConnectionError:
        return False
    else:
        return True

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
        df['valid_hostname'] = pd.Series(executor.map(validate_hostname_existence, df['hostname']),index=df['hostname'].index)

But that did not go so well, too, as I'm pretty new to parallel functions. It has multiple errors, and I believe it can be much more beneficial if I could somehow first check whether this hostname got checked already and skip the entire request all over again. I would like to go as far as I can without actually sending an email.

Is there a library or a way to accomplish this? As I could not find a proper solution to this problem so far.

CodePudding user response：

You answered your self, You can use cache to save in the memory the hostnames you already checked.

for example:

   from functools import lru_cache
   @lru_cache(max_size=None) 
   def my_is_email(x, check_dns=True):
       return is_email(x, check_dns=check_dns)

It's also recommended to limit the size to prevent memory overflow. for example:

@lru_cache(max_size=256)

for more information read This

CodePudding user response：

I don't know anything about panda but here's how you can process a list of emails in parallel and get back of a set of valid emails. I'm sure you can adapt this to your panda case.

from queue import Empty, Queue
from threading import Thread, Lock
from pyisemail import is_email

q = Queue()
lock = Lock()
valid = set()

emails = ["[email protected]", "[email protected]"]
for e in emails:
    q.put(e)


def process_queue(queue: Queue):
    while True:
        try:
            email = queue.get(block=False)
        except Empty:
            break
        if is_email(email, check_dns=True):
            lock.acquire()
            valid.add(email)
            lock.release()


NUM_THREADS = 30
threads = []

for i in range(NUM_THREADS):
    thread = Thread(target=process_queue, args=(q,))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()

print("done")
print(valid)

Explanation

Create a queue object filled with emails
Create NUM_THREADS threads.
Each thread pulls from queue. If they get an email it process the email. locks the lock protecting the results set. adds to the set. releases. If there are no emails left the thread terminates.
Wait for all threads to terminate.