I have a pandas data frame, which has a column of the hostname of each email address (over 1000 rows):
email hostname
[email protected] example.com
[email protected] example.com
[email protected] example2.com
[email protected] example3.com
I want to go through each hostname and check whether it truly exists or not.
email hostname valid_hostname
[email protected] example.com True
[email protected] example.com False
[email protected] example2.com False
[email protected] example3.com False
First, I extracted the hostname of each email address:
df['hostname'] = df['email'].str.split('@').str[1]
Then, I tried to check the DNS using pyIsEmail
, but that was too slow:
from pyisemail import is_email
df['valid_hostname'] = df['hostname'].apply(lambda x: is_email(x, check_dns=True))
Then, I tried a multi-threaded function:
import requests
from requests.exceptions import ConnectionError
def validate_hostname_existence(hostname:str):
try:
response = requests.get(f'http://{hostname}', timeout=0.5)
except ConnectionError:
return False
else:
return True
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
df['valid_hostname'] = pd.Series(executor.map(validate_hostname_existence, df['hostname']),index=df['hostname'].index)
But that did not go so well, too, as I'm pretty new to parallel functions. It has multiple errors, and I believe it can be much more beneficial if I could somehow first check whether this hostname got checked already and skip the entire request all over again. I would like to go as far as I can without actually sending an email.
Is there a library or a way to accomplish this? As I could not find a proper solution to this problem so far.
CodePudding user response:
You answered your self, You can use cache to save in the memory the hostnames you already checked.
for example:
from functools import lru_cache
@lru_cache(max_size=None)
def my_is_email(x, check_dns=True):
return is_email(x, check_dns=check_dns)
It's also recommended to limit the size to prevent memory overflow. for example:
@lru_cache(max_size=256)
for more information read This
CodePudding user response:
I don't know anything about panda but here's how you can process a list of emails in parallel and get back of a set of valid emails. I'm sure you can adapt this to your panda case.
from queue import Empty, Queue
from threading import Thread, Lock
from pyisemail import is_email
q = Queue()
lock = Lock()
valid = set()
emails = ["[email protected]", "[email protected]"]
for e in emails:
q.put(e)
def process_queue(queue: Queue):
while True:
try:
email = queue.get(block=False)
except Empty:
break
if is_email(email, check_dns=True):
lock.acquire()
valid.add(email)
lock.release()
NUM_THREADS = 30
threads = []
for i in range(NUM_THREADS):
thread = Thread(target=process_queue, args=(q,))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
print("done")
print(valid)
Explanation
- Create a queue object filled with emails
- Create NUM_THREADS threads.
- Each thread pulls from queue. If they get an email it process the email. locks the lock protecting the results set. adds to the set. releases. If there are no emails left the thread terminates.
- Wait for all threads to terminate.