Can you sample a Pandas dataframe and modify the original dataframe to remove the sampled rows?-CodePudding

What I'm asking may not be computationally efficient/inexpensive.

Essentially what I want to do is select a row from my pandas DataFrame at random, and then modify the original DataFrame so that the row is essentially "popped" from the dataframe.

So far what I've tried is taking the transpose of the DataFrame, and then applying pop() over the "column" I want to remove. The index of the column is chosen by a random number.

import pandas as pd
from random import randrange

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

df_t = df.T

random_index = randrange(len(df_t))

popped = df_t.pop(random_index)

df = df_t.T

While this works for this small dataframe, I'm unsure if it will scale well to the dataframe I'm intending on doing this with. For context, I'm working with a pandas dataframe of 30-50k rows. I'll need to perform this process repeatedly until the rows are more or less exhausted.

Is there a more computationally efficient way to perform what I'm trying to do?

CodePudding user response：

IIUC, you want to randomly select a row from your dataframe and remove it.

Essentially, we use numpy's random seed to pick an index within range of available indices, and then remove the index that we care about.

You can use the following example to do that (reference)

import numpy as np, pandas as pd

np.random.seed(8)

remove_n = 2
df = pd.DataFrame({"a":[1,2,3,4,5], "b":[6,7,8,9,10]})
idx_to_drop = np.random.choice(df.index, remove_n, replace=False)

# In place
df.drop(idx_to_drop, inplace=True, axis=0)
print(df)

# res = df.drop(idx_to_drop)

CodePudding user response：

You can use drop:

popped = df.iloc[random_index]
df.drop(random_index, inplace=True)

Output:

>>> random_index
0

>>> popped
a    1
b    2
c    3
Name: 0, dtype: int64

>>> df
   a  b  c
1  4  5  6
2  7  8  9

CodePudding user response：

You can use sklearn's shuffle:

from sklearn.utils import shuffle

df2 = shuffle(df)

This will randomly shuffle all the rows in the dataframe, so you could then loop through the rows for what you intend to do (and they will be randomly ordered), or keep the shuffled dataframe and continue with your code.

Using this method means you don't need to drop rows from your initial dataframe.