I would like to generate a dummy dataset composed of a fake first name and a last name for 40 milion records using multiple processor n cores.
Below is a single task loop that generates a first name and a last name and appends them to a list:
import pandas as pd
from faker import Faker
def fake_data_generation(records):
fake = Faker(['en_US','en_GB'])
person = []
for i in range(records):
first_name = fake.first_name()
last_name = fake.last_name()
person.append({"First_Name": first_name,
"Last_Name": last_name}
)
return person
Output:
for i in range(5):
df = pd.DataFrame(fake_data_generation(i))
df
First_Name | Last_Name |
---|---|
Faith | Williams |
Colin | Mitchell |
Samantha | Rodgers |
Anna | Blackwell |
CodePudding user response:
Maybe you can use providers
directly:
import pandas as pd
import numpy as np
from faker.providers.person.en_US import Provider as us
from faker.providers.person.en_GB import Provider as gb
first_names = list(set(us.first_names).union(gb.first_names))
last_names = list(set(us.last_names).union(gb.last_names))
N = 40_000_000
df = pd.DataFrame({'First_Name': np.random.choice(first_names, N),
'Last_Name': np.random.choice(last_names, N)})
Output:
>>> df
First_Name Last_Name
0 Kayla Tran
1 Gary Bates
2 Daisy Leblanc
3 Tiffany Ahmed
4 Kellie May
... ... ...
39999995 Kristine Collier
39999996 Joyce Mccoy
39999997 Paul Padilla
39999998 Tonya Bevan
39999999 Julie Bright
[40000000 rows x 2 columns]
CodePudding user response:
I have attempted the below that worked with me. I'd appreciate any reviews or modifications for better performance or reducing any unnecessary steps.
from joblib import Parallel, delayed
import pandas as pd
from faker import Faker
from itertools import chain
fake = Faker(['en_US','en_GB'])
def generate_names_df():
names = []
first_name = fake.first_name()
last_name = fake.last_name()
names.append({"First_Name": first_name,
"Last_Name": last_name}
)
return names
results = Parallel(n_jobs=15)(delayed(generate_names_df)() for i in range(40000000))
results_unlisted = list(chain(*results))
df = pd.DataFrame(results_unlisted)
df.shape