Home > Software design >  Parallelize dummy data generation in pandas
Parallelize dummy data generation in pandas

Time:07-05

I would like to generate a dummy dataset composed of a fake first name and a last name for 40 milion records using multiple processor n cores.

Below is a single task loop that generates a first name and a last name and appends them to a list:

import pandas as pd
from faker import Faker

def fake_data_generation(records):
    fake = Faker(['en_US','en_GB'])
    
    person = []
    
    for i in range(records):
        first_name = fake.first_name()
        last_name = fake.last_name()
        person.append({"First_Name": first_name,
                       "Last_Name": last_name}
                     )
    return person

Output:

for i in range(5):
    df = pd.DataFrame(fake_data_generation(i))
df
First_Name Last_Name
Faith Williams
Colin Mitchell
Samantha Rodgers
Anna Blackwell

CodePudding user response:

Maybe you can use providers directly:

import pandas as pd
import numpy as np
from faker.providers.person.en_US import Provider as us
from faker.providers.person.en_GB import Provider as gb

first_names = list(set(us.first_names).union(gb.first_names))
last_names = list(set(us.last_names).union(gb.last_names))

N = 40_000_000
df = pd.DataFrame({'First_Name': np.random.choice(first_names, N),
                   'Last_Name': np.random.choice(last_names, N)})

Output:

>>> df
         First_Name Last_Name
0             Kayla      Tran
1              Gary     Bates
2             Daisy   Leblanc
3           Tiffany     Ahmed
4            Kellie       May
...             ...       ...
39999995   Kristine   Collier
39999996      Joyce     Mccoy
39999997       Paul   Padilla
39999998      Tonya     Bevan
39999999      Julie    Bright

[40000000 rows x 2 columns]

CodePudding user response:

I have attempted the below that worked with me. I'd appreciate any reviews or modifications for better performance or reducing any unnecessary steps.

from joblib import Parallel, delayed
import pandas as pd
from faker import Faker
from itertools import chain

fake = Faker(['en_US','en_GB'])

def generate_names_df():
    names = []
    first_name = fake.first_name()
    last_name = fake.last_name()
    names.append({"First_Name": first_name,
                  "Last_Name": last_name}
                )
    return names

results = Parallel(n_jobs=15)(delayed(generate_names_df)() for i in range(40000000))
results_unlisted = list(chain(*results))
df = pd.DataFrame(results_unlisted)
df.shape
  • Related