How to randomly sample from a datafframe while preserving the distribution in Python?-CodePudding

I am using a Kaggle sample data. As shown bellow, 40% of the location is in CA and 47% of the category includes FOODS. What I am trying to achieve is to randomly select data from this data frame, while more or less preserve the same distribution for the values of the these two columns. Does python/Pandas have such a capability?

>>> df = pd.read_parquet("~/dimension.parquet")
>>> df.groupby('location')['location'].count().transform(lambda x: x/x.sum())
location
CA    0.4
TX    0.3
WI    0.3
>>> df.groupby('category')['category'].count().transform(lambda x: x/x.sum())
category
FOODS        0.471302
HOBBIES      0.185307
HOUSEHOLD    0.343391

CodePudding user response：

Your can select a fraction of each group with groupby.sample:

# selecting 10% of each group
df.groupby(['location', 'category']).sample(frac=0.1)

But if your data is large and you select a decent number of rows, this should naturally maintain a representativity of the proportions:

df.sample(n=1000)

Example, let's pick 500 (0.05%) or 5000 (0.5%) rows out of 1M rows with defined frequencies:

np.random.seed(0)

n = 1_000_000
df = pd.DataFrame({'location': np.random.choice(['CA', 'TX', 'WI'], p=[0.4, 0.3, 0.3], size=n),
                   'category': np.random.choice(['A', 'B', 'C'], p=[0.85, 0.1, 0.05], size=n)})

out = df.sample(n=500)

out['location'].value_counts(normalize=True)
CA    0.388
TX    0.312
WI    0.300
Name: location, dtype: float64

out['category'].value_counts(normalize=True)
A    0.822
B    0.126
C    0.052
Name: category, dtype: float64

With df.sample(n=5000):

CA    0.3984
TX    0.3064
WI    0.2952
Name: location, dtype: float64
A    0.8468
B    0.1042
C    0.0490
Name: category, dtype: float64

Frequencies of the original population:

CA    0.399295
WI    0.300520
TX    0.300185
Name: location, dtype: float64
A    0.850125
B    0.099679
C    0.050196
Name: category, dtype: float64

We observe that both samples are fairly representative of the original population, with nevertheless some loss of precision with the smaller sampling.

In contrast, the groupby.sample maintains a close to original proportion even with very small samples (here with ~200 rows (0.02%)):

out2 = df.groupby(['location', 'category']).sample(frac=0.0002)
print(out2['location'].value_counts(normalize=True))
print(out2['category'].value_counts(normalize=True))
len(out2)

CA    0.4
TX    0.3
WI    0.3
Name: location, dtype: float64
A    0.85
B    0.10
C    0.05
Name: category, dtype: float64
200