Is there a python function for subsampling a large sample set to match the distribution of a variabl-CodePudding

I have 2 sample data sets (pandas dataframes):

df_1 = 700 students
df_2 = 200 stuents

Each of the dataframes have the same columns

student_id
height

I want to subset df_1 so that it also has 200 students where they have the same height distribution as the students in df_2. I have the mean, std, min, median of the df_2 students if I can use that in some way.

CodePudding user response：

Well I am not sure if I got you completely right, but let me suggests two steps.

Subsampling

To create random subsample from your dataset, you can use the following function which will return a new dataframe (deepcopy) from your initial df (it will not be mutated).

seed = 42
np.random.seed(seed)

def subsample(df, size: int):
    assert 0 < size < len(data)
    subsample_indexes = np.random.randint(0, len(data), size)
    return df.iloc[subsample_indexes, :].copy()

Same Distribution ?

For this, I can suggest you that you can use the subsampling function above, make some iterations (e.g 50 subsamples) compare each subsample's distribution to distribution of df2, a pseudo-code would be like this,

def compare_distributions(df, df_compare, n_subsample = 50):
    preserve_subsample = False
    n = 1
    while not preserve_subsample or n < n_subsample:
        df_sub = subsample(df, 200)
        # check if distributions is similar
        # here you may conduct a hypothesis test and/or
        # look at some statistics
        preserve_subsample = compare(df_sub, df_compare)
        n  = 1
    if not preserve_subsample:
        # return empty df
        return pd.DataFrame()
    return df_sub

CodePudding user response：

You could combine pd.cut() with df.sample().

For example, use pd.cut() to get the bin edges that separate df_2.height into 20 bins of 10 students each. Then iterate over those bins and for each bin, use df.sample() to sample 10 students from the subset of df_1 that falls within that height bin. If there are fewer than 10 students in any such bin, you might consider sampling with replacement, or reducing the number of bins from the start.

This will give you a random subset of df_1 with approximately the same height distribution as in df_2.