Pandas - Generate New Values from Random Selection of Sample Data-CodePudding

I have a sample dataset with 1mil records. I am looking to select a random value from each column to generate a new row to create a sample dataset with 3mil rows. I found a way to do this, however, it takes about 1s per row. Is there a way to make this more efficient and faster?

    newRows = 3000000
    newData = pd.DataFrame()
    start = time.time()
    for i in range(newRows):
        dict = {}
        for column in source.columns:
            dict[column] = [source.sample()[column].values[0]]
        newData = newData.append(dict,ignore_index=True)
    end = time.time()
    elasped = end - start
    print(elasped)
    print(newRows/elasped)

CodePudding user response：

Try with numpy.random.choice:

import numpy as np

indices = np.random.choice(df.index, size=(newRows, df.shape[1]), replace=True)
newData = pd.DataFrame(data=source.to_numpy()[indices, np.arange(len(source.columns))], 
                       columns=source.columns)

Or:

newData = source.apply(np.random.choice, size=newRows, replace=True)

CodePudding user response：

Loops by the columns, sample 3M from each. Then create a new data frame:

newRows = 3000000

newData = pd.DataFrame({
   c: source[c].sample(n=newRows, replace=True).values
   for c in source.columns  # `for c in source` works as well
})