Home > Software design >  Generate values in separate dataframe
Generate values in separate dataframe

Time:02-12

I trying to generate random data with Pandas.

Data is need to be stored in two columns. The first column needs to contain categorical variables (from Stratum_1 until Stratum_19) each of these stratums can contain a random number of values.

Second column needs to have data in the range between 1 to 180000000 with a standard deviation of 453210, a mean of 170000, and a number of rows 100000.

I try to

   categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
    'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}


desired_mean = 170000
desired_std_dev = 453210

df = pd.DataFrame(np.random.randint(0,180000000,size=(100000, 1)),columns=list('1'))

I tried with this code above but don't know how to implement categorical and numerical values together with desired mean and standard deviation. So can anybody help how to solve this problem and generate?

CodePudding user response:

Try:

import numpy as np
categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
    'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}


desired_mean = 170000
desired_std_dev = 453210

df = pd.DataFrame({'num':random.normal(170000, 453210,size=(100000, 1)).reshape(-1), 'cat':np.random.choice(categorical['name'], 100000)})

result:

enter image description here

CodePudding user response:

Try the following code:

import numpy as np
import pandas as pd

categorical = {'name': ['Stratum_1','Stratum_2','Stratum_3','Stratum_4','Stratum_5','Stratum_6','Stratum_7','Stratum_8','Stratum_9',
    'Stratum_10','Stratum_11','Stratum_12','Stratum_13','Stratum_14','Stratum_15','Stratum_16','Stratum_17','Stratum_18','Stratum_19']}

n_rows = 100000
desired_mean = 170000
desired_std_dev = 453210

generator = np.random.default_rng()

df = pd.DataFrame(categorical).sample(n=n_rows, replace=True).reset_index(drop=True)
df['value'] = generator.normal(loc=desired_mean, scale=desired_std_dev, size=n_rows)
  • Related