Home > Enterprise >  Python random sample from dataframe with given characteristics
Python random sample from dataframe with given characteristics

Time:12-01

I have a dataframe df with patients subject_id, including their gender and their age.

I would like to draw a random sample of size n from this dataframe, with the following characteristics:

  • 50% male, 50% female
  • Median age of 40 years

Any idea how I could accomplish that using python? Thank you!

CodePudding user response:

I think what you want is a little bit more complex than what DataFrame.sample provides out of the box. A random sample satisfying each of your conditions could be generated (respectively) like this:

  1. Filter for women only, and randomly sample n/2, then do the same for men, and then pool them
  2. Filter for under 40s, randomly sample n/2, then do the same for over-40s and then combine them. (Though note that this does not guarantee a median of exactly 40.)

If you want to combine the two constraints, you might need to sample 4 times - women under 40, men under 40, etc. But this is the general idea.

Code for sampling would look like:

df.loc[df.age < 40, 'subject_id'].sample(n/2)
df.loc[df.gender == 'F', 'subject_id'].sample(n/2)
  • Related