I have a dataframe df
with patients subject_id
, including their gender
and their age
.
I would like to draw a random sample of size n
from this dataframe, with the following characteristics:
- 50% male, 50% female
- Median age of 40 years
Any idea how I could accomplish that using python? Thank you!
CodePudding user response:
I think what you want is a little bit more complex than what DataFrame.sample
provides out of the box. A random sample satisfying each of your conditions could be generated (respectively) like this:
- Filter for women only, and randomly sample
n/2
, then do the same for men, and then pool them - Filter for under 40s, randomly sample
n/2
, then do the same for over-40s and then combine them. (Though note that this does not guarantee a median of exactly 40.)
If you want to combine the two constraints, you might need to sample 4 times - women under 40, men under 40, etc. But this is the general idea.
Code for sampling would look like:
df.loc[df.age < 40, 'subject_id'].sample(n/2)
df.loc[df.gender == 'F', 'subject_id'].sample(n/2)