Home > Enterprise >  Generating random values for a test DataFrame given constraints
Generating random values for a test DataFrame given constraints

Time:08-29

I'm trying to generate a DataFrame that has random values; something like this:

In [75]: df
Out[75]: 
        Name       mag1       mag2       mag3  redshift
0   Galaxy 1  11.657170  12.881492  14.230583    0.1125
1   Galaxy 2  19.720113  14.297871        NaN    1.2252
2   Galaxy 3  11.026038  11.116287  17.689447    2.5548
3   Galaxy 4        NaN  16.218209  11.928297    1.8845
4   Galaxy 5  15.287412  19.199692  19.392112    4.5512
5   Galaxy 6  12.283413  12.425423  19.141460    0.9583
6   Galaxy 7  18.738156        NaN  16.179031    1.8271
7   Galaxy 8  16.277030  13.728240  11.800716    2.8819
8   Galaxy 9  16.672178  14.608468  10.145000    3.9710
9  Galaxy 10  17.836160  17.828570  13.813578    0.2790

The columns have been generated with

col0 = ['Galaxy 1','Galaxy 2','Galaxy 3','Galaxy 4','Galaxy 5','Galaxy 6','Galaxy 7','Galaxy 8','Galaxy 9','Galaxy 10']
col1 = np.array([np.random.uniform(10, 20, 10)])
col2 = np.array([np.random.uniform(10, 20, 10)])
col3 = np.array([np.random.uniform(10, 20, 10)])
col4 = np.array([np.random.uniform(0.01, 5, 10)])

and stitched together with

df = pd.DataFrame(list(zip(col0, col1, col2, col3, col4)))

The NaNs were inserted manually (no Nans in redshift). This works fine, but how could I automate this to produce a random DataFrame with a variable number of mags but with a similar structure? Perhaps with a call like df = random_df(size = (20, 5) for 20 Galaxies and 5 mag columns?

CodePudding user response:

import pandas as pd
import numpy as np

def make_test_df(n_galaxies=10, n_mags=3, seed=0):
    np.random.seed(seed)
    data = np.random.uniform(10, 20, (n_galaxies,n_mags))
    data[(np.random.choice(n_galaxies, n_mags, replace=False), range(n_mags))] = np.nan

    df = pd.DataFrame(data, columns=[f'mag{i}' for i in range(1, n_mags   1)])
    df.insert(0, 'Name', [f'Galaxy {i}' for i in range(1, n_galaxies   1)])
    df['redshift'] = np.random.uniform(0.01, 5, n_galaxies)
    
    return df

Result of make_test_df(20, 5):

         Name       mag1       mag2       mag3       mag4       mag5  redshift
0    Galaxy 1  15.488135  17.151894  16.027634  15.448832  14.236548  1.494210
1    Galaxy 2  16.458941  14.375872        NaN  19.636628  13.834415  4.070851
2    Galaxy 3  17.917250  15.288949  15.680446  19.255966  10.710361  1.988564
3    Galaxy 4  10.871293  10.202184  18.326198  17.781568  18.700121  4.406705
4    Galaxy 5  19.786183  17.991586  14.614794  17.805292  11.182744  2.910552
5    Galaxy 6  16.399210  11.433533  19.446689        NaN  14.146619  4.409859
6    Galaxy 7        NaN  17.742337  14.561503  15.684339  10.187898  3.465733
7    Galaxy 8  16.176355  16.120957  16.169340  19.437481  16.818203  3.629019
8    Galaxy 9  13.595079  14.370320  16.976312  10.602255  16.667667  2.511609
9   Galaxy 10  16.706379        NaN  11.289263  13.154284  13.637108  4.780857
10  Galaxy 11  15.701968  14.386015  19.883738  11.020448  12.088768  3.223511
11  Galaxy 12  11.613095  16.531083  12.532916  14.663108  12.444256  2.125037
12  Galaxy 13  11.589696  11.103751  16.563296  11.381830  11.965824  3.035902
13  Galaxy 14  13.687252  18.209932  10.971013  18.379449  10.960984  0.105774
14  Galaxy 15  19.764595  14.686512  19.767611  16.048455        NaN  1.514858
15  Galaxy 16  10.391878  12.828070  11.201966  12.961402  11.187277  3.304266
16  Galaxy 17  13.179832  14.142630  10.641475  16.924721  15.666015  1.457487
17  Galaxy 18  12.653895  15.232481  10.939405  15.759465  19.292962  3.093897
18  Galaxy 19  13.185690  16.674104  11.317979  17.163272  12.894061  2.149556
19  Galaxy 20  11.831914  15.865129  10.201075  18.289400  10.046955  0.686016
  • Related