How do I generate random strings in a column based on some parameters from other columns of a datafr-CodePudding

I am trying to create a simulated dataset of emails. For that, I want to generate recipients based on 2 parameters:

how many recipients
how many domains those recipients will be from

For that, I have created a dataframe whose first few rows are as follows:

import pandas as pd

data = {'Date':['19/06/2022', '19/06/2022', '20/06/2022', '20/06/2022', '21/06/2022', '21/06/2022'],
        'Time':['8:25:12', '9:21:33', '18:26:28', '11:39:23', '7:30:47', '20:27:48'],
        'Sender': ['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]'],
        'Number of recipient domains': [2, 3, 3, 4, 3, 5],
        'Number of recipients': [7, 4, 7, 4, 6, 7]
       }
df = pd.DataFrame(data)

Now I need to generate the Recipients column, that will hold some random recipients according to the columns Number of recipient domains and Number of recipients. For example, take the first row - it should generate 7 email addresses from 2 domains (like, @abc.com and @xyz.com for instance).

How do I do that?

I can write a function to generate email addresses with only the number of email addresses to be generated as argument (i.e., with the number of domains parameter removed):

import string, random

def random_email_gen(num_recipients):
    emails = []
    for _ in range(int(num_recipients)):
        name = ''.join(random.choice(string.ascii_lowercase) for _ in range(4))
        domain = ''.join(random.choice(string.ascii_lowercase) for _ in range(3))
        
        emails.append(name   '@'   domain   '.com')
    return emails

random_email_gen(num_recipients=7)
>>> ['[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]']

But how do I extend it to generate the randomized email addresses with the number of domains parameter as well?

CodePudding user response：

You can pre-generate a list of names and domains and just combine them to create the number of combinations you need. The main trick is to use random.sample to get samples without replacement, so you are assured to have unique names.

Note that, in production code, the asserts should be changed for something like a while loop that tries to create unique names until it succeeds (for any number of names or domains larger than 3, the chance of getting it right the first time is very high, so this is not a significant performance penalty).

import random as rd
import string

# Create a list of names and domains, for efficiency:
max_num_names = max(df['Number of recipients'])
max_num_domains = max(df['Number of recipient domains'])

domain_ends = ['.com', '.org', '.net']
domains = list({''.join(rd.choice(string.ascii_lowercase) for _ in range(max_num_domains))   rd.choice(domain_ends) for _ in range(max_num_domains)})
names = list({''.join(rd.choice(string.ascii_lowercase) for _ in range(max_num_names)) for _ in range(max_num_names)})

# These asserts will fail only if we were very unlucky and hit two equal names, which has a probability of 26**(-n), where n is either max_num_names or max_num_domains
assert len(domains) == max_num_domains
assert len(names) == max_num_names

# This function generates a list of n_names, that is assured to be all different and have exactly n_domains
def get_random_names(n_domains, n_names):
    local_domains = rd.sample(domains, n_domains)
    local_names = rd.sample(names, n_names)
    return ['@'.join((name, local_domains[i % n_domains])) for i, name in enumerate(local_names)]

CodePudding user response：

def random_email_gen(num_recipients, num_domains):
    names = [''.join(random.choice(string.ascii_lowercase) for _ in range(4)) for _ in range(num_recipients)]
    domains = [''.join(random.choice(string.ascii_lowercase) for _ in range(3)) for _ in range(num_domains)]
    return [f'{name}@{domains[i % num_domains]}.com' for i, name in enumerate(names)]

Result for random_email_gen(7, 2) (may vary):

['[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]',
 '[email protected]']

There's, however, a small probability that there might be multiple identical names or domains in the lists generated in the function.