I'm pretty new to Python and not sure what to even google for this. What I am trying to do is create a Pandas DataFrame that is filled with fake data by using Faker. The problem I am having is each column is generating fake data in a silo. I want to be able to have fake data created based on something that exists in a prior column.
So in my example below, I have pc_type ["PC", "Apple]
From there I have the operating system and the options are Windows 10, Windows 11, and MacOS. Now I want only where pc_type = "Apple"
to have the columns fill with the value of MacOS. Then for everything that is type PC, it's 50% Windows 10 and 50% Windows 11.
How would I write this code so that in the function body I can make that distinction clear and the results will reflect that?
from faker import Faker
from faker.providers import BaseProvider, DynamicProvider
import numpy as np
import pandas as pd
from datetime import datetime
import random
pc_type = ['PC', 'Apple']
fake = Faker()
def create_data(x):
project_data = {}
for i in range(0, x):
project_data[i] = {}
project_data[i]['Name'] = fake.name()
project_data[i]['PC Type'] = fake.random_element(pc_type)
project_data[i]['With Windows 10'] = fake.boolean(chance_of_getting_true=25)
project_data[i]['With Windows 11 '] = fake.boolean(chance_of_getting_true=25)
project_data[i]['With MacOS'] = fake.boolean(chance_of_getting_true=50)
return project_data
df = pd.DataFrame(create_data(10)).transpose()
df
CodePudding user response:
To have coherent values, you can use something like:
from faker import Faker
import pandas as pd
import numpy as np
def create_data(x):
pc_type = ['PC', 'Apple']
fake = Faker()
data = {'Name': [fake.name() for _ in range(x)],
'PC Type': np.random.choice(pc_type, x)}
df = pd.DataFrame(data)
df['With MacOS'] = df['PC Type'] == 'Apple'
pc = df['PC Type'] == 'PC'
w10 = np.random.choice([True, False], len(df), p=(0.5, 0.5))
df['With Windows 10'] = pc & w10
df['With Windows 11'] = pc & ~w10
return df
df = create_data(10)
Output:
>>> df
Name PC Type With MacOS With Windows 10 With Windows 11
0 Charles Dawson PC False True False
1 Patricia Bautista PC False False True
2 Ruth Clark PC False True False
3 Justin Lopez PC False True False
4 Grace Russell PC False True False
5 Grant Moss PC False True False
6 Tracy Ho Apple True False False
7 Connie Mitchell Apple True False False
8 Catherine Nichols Apple True False False
9 Nathaniel Bryant PC False False True
CodePudding user response:
I'd slightly change the approach and generate a column OS
. This column you can than transform into With MacOS
etc. if needed.
With this approach its easier to get the 0.5 / 0.5 split within Windows right:
from faker import Faker
from faker.providers import BaseProvider, DynamicProvider
import numpy as np
import pandas as pd
from datetime import datetime
import random
from collections import OrderedDict
pc_type = ['PC', 'Apple']
wos_type = OrderedDict([('With Windows 10', 0.5), ('With Windows 11', 0.5)])
fake = Faker()
def create_data(x):
project_data = {}
for i in range(x):
project_data[i] = {}
project_data[i]['Name'] = fake.name()
project_data[i]['PC Type'] = fake.random_element(pc_type)
if project_data[i]['PC Type'] == 'PC':
project_data[i]['OS'] = fake.random_element(elements = wos_type)
else:
project_data[i]['OS'] = 'MacOS'
return project_data
df = pd.DataFrame(create_data(10)).transpose()
df
Output
Name PC Type OS
0 Nicholas Walker Apple MacOS
1 Eric Hull PC With Windows 10
2 Veronica Gonzales PC With Windows 11
3 Mrs. Krista Richardson Apple MacOS
4 Anne Craig PC With Windows 10
5 Joseph Hayes PC With Windows 10
6 Mary Nelson Apple MacOS
7 Jill Hunt Apple MacOS
8 Mark Taylor PC With Windows 11
9 Kyle Thompson PC With Windows 10