I am using a Kaggle sample data. As shown bellow, 40% of the location is in CA
and 47% of the category
includes FOODS
. What I am trying to achieve is to randomly select data from this data frame, while more or less preserve the same distribution for the values of the these two columns. Does python/Pandas have such a capability?
>>> df = pd.read_parquet("~/dimension.parquet")
>>> df.groupby('location')['location'].count().transform(lambda x: x/x.sum())
location
CA 0.4
TX 0.3
WI 0.3
>>> df.groupby('category')['category'].count().transform(lambda x: x/x.sum())
category
FOODS 0.471302
HOBBIES 0.185307
HOUSEHOLD 0.343391
CodePudding user response:
Your can select a fraction of each group with groupby.sample
:
# selecting 10% of each group
df.groupby(['location', 'category']).sample(frac=0.1)
But if your data is large and you select a decent number of rows, this should naturally maintain a representativity of the proportions:
df.sample(n=1000)
Example, let's pick 500 (0.05%) or 5000 (0.5%) rows out of 1M rows with defined frequencies:
np.random.seed(0)
n = 1_000_000
df = pd.DataFrame({'location': np.random.choice(['CA', 'TX', 'WI'], p=[0.4, 0.3, 0.3], size=n),
'category': np.random.choice(['A', 'B', 'C'], p=[0.85, 0.1, 0.05], size=n)})
out = df.sample(n=500)
out['location'].value_counts(normalize=True)
CA 0.388
TX 0.312
WI 0.300
Name: location, dtype: float64
out['category'].value_counts(normalize=True)
A 0.822
B 0.126
C 0.052
Name: category, dtype: float64
With df.sample(n=5000)
:
CA 0.3984
TX 0.3064
WI 0.2952
Name: location, dtype: float64
A 0.8468
B 0.1042
C 0.0490
Name: category, dtype: float64
Frequencies of the original population:
CA 0.399295
WI 0.300520
TX 0.300185
Name: location, dtype: float64
A 0.850125
B 0.099679
C 0.050196
Name: category, dtype: float64
We observe that both samples are fairly representative of the original population, with nevertheless some loss of precision with the smaller sampling.
In contrast, the groupby.sample
maintains a close to original proportion even with very small samples (here with ~200 rows (0.02%)):
out2 = df.groupby(['location', 'category']).sample(frac=0.0002)
print(out2['location'].value_counts(normalize=True))
print(out2['category'].value_counts(normalize=True))
len(out2)
CA 0.4
TX 0.3
WI 0.3
Name: location, dtype: float64
A 0.85
B 0.10
C 0.05
Name: category, dtype: float64
200