I have a pandas dataframe of 1000 elements, with value counts shown below. I would like to sample from this dataset in a way that the value counts follow a long-tailed distribution. For example, to maintain the long-tailed distribution, sample4
may only end up with a value count of 400.
a
sample1 750
sample2 746
sample3 699
sample4 652
sample5 622
...
sample996 4
sample997 3
sample998 2
sample999 2
sample1000 1
I tried using this code:
import numpy as np
# Calculate the frequency of each element in column 'area'
freq = df['a'].value_counts()
# Calculate the probability of selecting each element based on its frequency
prob = freq / freq.sum()
# Sample from the df_wos dataframe without replacement
df_sampled = df.sample(n=len(df), replace=False, weights=prob.tolist())
However, I end up with errors ValueError: Weights and axis to be sampled must be of same length
.
CodePudding user response:
You have duplicated values. So you need to compute prob for all values. You need to use groupby
and count
instead of value_counts
.
freq = df.groupby('Value')['Value'].transform('count')
prob = freq / len(df)
df_sampled = df.sample(n=len(df), replace=False, weights=prob.tolist())