How do I add data to a column only if a certain value exists in previous column using Python and Fak-CodePudding

I'm pretty new to Python and not sure what to even google for this. What I am trying to do is create a Pandas DataFrame that is filled with fake data by using Faker. The problem I am having is each column is generating fake data in a silo. I want to be able to have fake data created based on something that exists in a prior column.

So in my example below, I have pc_type ["PC", "Apple] From there I have the operating system and the options are Windows 10, Windows 11, and MacOS. Now I want only where pc_type = "Apple" to have the columns fill with the value of MacOS. Then for everything that is type PC, it's 50% Windows 10 and 50% Windows 11.

How would I write this code so that in the function body I can make that distinction clear and the results will reflect that?

from faker import Faker
from faker.providers import BaseProvider, DynamicProvider
import numpy as np
import pandas as pd
from datetime import datetime
import random

pc_type = ['PC', 'Apple']
fake = Faker()


def create_data(x):
    project_data = {}
    for i in range(0, x):
        project_data[i] = {}
        project_data[i]['Name'] = fake.name()
        project_data[i]['PC Type'] = fake.random_element(pc_type)
        project_data[i]['With Windows 10'] = fake.boolean(chance_of_getting_true=25)
        project_data[i]['With Windows 11 '] = fake.boolean(chance_of_getting_true=25)
        project_data[i]['With MacOS'] = fake.boolean(chance_of_getting_true=50)

    return project_data


df = pd.DataFrame(create_data(10)).transpose()
df

CodePudding user response：

To have coherent values, you can use something like:

from faker import Faker
import pandas as pd
import numpy as np


def create_data(x):
    pc_type = ['PC', 'Apple']
    fake = Faker()
    data = {'Name': [fake.name() for _ in range(x)],
            'PC Type': np.random.choice(pc_type, x)}
    df = pd.DataFrame(data)
    df['With MacOS'] = df['PC Type'] == 'Apple'

    pc = df['PC Type'] == 'PC'
    w10 = np.random.choice([True, False], len(df), p=(0.5, 0.5))
    df['With Windows 10'] = pc & w10
    df['With Windows 11'] = pc & ~w10

    return df

df = create_data(10)

Output:

>>> df
                Name PC Type  With MacOS  With Windows 10  With Windows 11
0     Charles Dawson      PC       False             True            False
1  Patricia Bautista      PC       False            False             True
2         Ruth Clark      PC       False             True            False
3       Justin Lopez      PC       False             True            False
4      Grace Russell      PC       False             True            False
5         Grant Moss      PC       False             True            False
6           Tracy Ho   Apple        True            False            False
7    Connie Mitchell   Apple        True            False            False
8  Catherine Nichols   Apple        True            False            False
9   Nathaniel Bryant      PC       False            False             True

CodePudding user response：

I'd slightly change the approach and generate a column OS. This column you can than transform into With MacOS etc. if needed.

With this approach its easier to get the 0.5 / 0.5 split within Windows right:

from faker import Faker
from faker.providers import BaseProvider, DynamicProvider
import numpy as np
import pandas as pd
from datetime import datetime
import random
from collections import OrderedDict

pc_type = ['PC', 'Apple']
wos_type = OrderedDict([('With Windows 10', 0.5), ('With Windows 11', 0.5)])
fake = Faker()

def create_data(x):
    project_data = {}
    for i in range(x):
        project_data[i] = {}
        project_data[i]['Name'] = fake.name()
        project_data[i]['PC Type'] = fake.random_element(pc_type)
        if project_data[i]['PC Type'] == 'PC':
            project_data[i]['OS'] = fake.random_element(elements = wos_type)
        else:
            project_data[i]['OS'] = 'MacOS'

    return project_data


df = pd.DataFrame(create_data(10)).transpose()
df

Output

                     Name PC Type               OS
0         Nicholas Walker   Apple            MacOS
1               Eric Hull      PC  With Windows 10
2       Veronica Gonzales      PC  With Windows 11
3  Mrs. Krista Richardson   Apple            MacOS
4              Anne Craig      PC  With Windows 10
5            Joseph Hayes      PC  With Windows 10
6             Mary Nelson   Apple            MacOS
7               Jill Hunt   Apple            MacOS
8             Mark Taylor      PC  With Windows 11
9           Kyle Thompson      PC  With Windows 10