I have a sample dataset with 1mil records. I am looking to select a random value from each column to generate a new row to create a sample dataset with 3mil rows. I found a way to do this, however, it takes about 1s per row. Is there a way to make this more efficient and faster?
newRows = 3000000
newData = pd.DataFrame()
start = time.time()
for i in range(newRows):
dict = {}
for column in source.columns:
dict[column] = [source.sample()[column].values[0]]
newData = newData.append(dict,ignore_index=True)
end = time.time()
elasped = end - start
print(elasped)
print(newRows/elasped)
CodePudding user response:
Try with numpy.random.choice
:
import numpy as np
indices = np.random.choice(df.index, size=(newRows, df.shape[1]), replace=True)
newData = pd.DataFrame(data=source.to_numpy()[indices, np.arange(len(source.columns))],
columns=source.columns)
Or:
newData = source.apply(np.random.choice, size=newRows, replace=True)
CodePudding user response:
Loops by the columns, sample 3M from each. Then create a new data frame:
newRows = 3000000
newData = pd.DataFrame({
c: source[c].sample(n=newRows, replace=True).values
for c in source.columns # `for c in source` works as well
})