I have 2 sample data sets (pandas dataframes):
- df_1 = 700 students
- df_2 = 200 stuents
Each of the dataframes have the same columns
- student_id
- height
I want to subset df_1
so that it also has 200 students where they have the same height distribution as the students in df_2
. I have the mean, std, min, median of the df_2
students if I can use that in some way.
CodePudding user response:
Well I am not sure if I got you completely right, but let me suggests two steps.
Subsampling
To create random subsample from your dataset, you can use the following function which will return a new dataframe (deepcopy) from your initial df (it will not be mutated).
seed = 42
np.random.seed(seed)
def subsample(df, size: int):
assert 0 < size < len(data)
subsample_indexes = np.random.randint(0, len(data), size)
return df.iloc[subsample_indexes, :].copy()
Same Distribution ?
For this, I can suggest you that you can use the subsampling function above, make some iterations (e.g 50 subsamples) compare each subsample's distribution to distribution of df2
, a pseudo-code would be like this,
def compare_distributions(df, df_compare, n_subsample = 50):
preserve_subsample = False
n = 1
while not preserve_subsample or n < n_subsample:
df_sub = subsample(df, 200)
# check if distributions is similar
# here you may conduct a hypothesis test and/or
# look at some statistics
preserve_subsample = compare(df_sub, df_compare)
n = 1
if not preserve_subsample:
# return empty df
return pd.DataFrame()
return df_sub
CodePudding user response:
You could combine pd.cut()
with df.sample()
.
For example, use pd.cut()
to get the bin edges that separate df_2.height
into 20 bins of 10 students each. Then iterate over those bins and for each bin, use df.sample()
to sample 10 students from the subset of df_1
that falls within that height bin. If there are fewer than 10 students in any such bin, you might consider sampling with replacement, or reducing the number of bins from the start.
This will give you a random subset of df_1
with approximately the same height distribution as in df_2
.