How to sample based on long-tail distribution from a pandas dataframe?-CodePudding

I have a pandas dataframe of 1000 elements, with value counts shown below. I would like to sample from this dataset in a way that the value counts follow a long-tailed distribution. For example, to maintain the long-tailed distribution, sample4 may only end up with a value count of 400.

                           a
 sample1                  750
 sample2                  746
 sample3                  699
 sample4                  652
 sample5                  622
                          ... 
 sample996                  4
 sample997                  3
 sample998                  2
 sample999                  2
 sample1000                 1

I tried using this code:

import numpy as np

# Calculate the frequency of each element in column 'area'
freq = df['a'].value_counts()

# Calculate the probability of selecting each element based on its frequency
prob = freq / freq.sum()

# Sample from the df_wos dataframe without replacement
df_sampled = df.sample(n=len(df), replace=False, weights=prob.tolist())

However, I end up with errors ValueError: Weights and axis to be sampled must be of same length.

CodePudding user response：

You have duplicated values. So you need to compute prob for all values. You need to use groupby and count instead of value_counts.

freq = df.groupby('Value')['Value'].transform('count')
prob = freq / len(df)
df_sampled = df.sample(n=len(df), replace=False, weights=prob.tolist())