I am new here so don't know how to use this site.
I have a timeseries data of 37404 ICU Patients. Each patient have multiple rows. I want to down sample my dataframe and select only 2932 patients (all rows of the respective patient ID). Can anyone help me? My data looks like this:
HR | SBP | DBP | Sepsis | P_ID |
---|---|---|---|---|
92 | 120 | 80 | 0 | 0 |
98 | 115 | 85 | 0 | 0 |
93 | 125 | 75 | 0 | 1 |
95 | 130 | 90 | 0 | 1 |
102 | 120 | 80 | 0 | 1 |
109 | 115 | 75 | 0 | 2 |
94 | 135 | 100 | 0 | 2 |
97 | 100 | 70 | 0 | 3 |
85 | 120 | 80 | 0 | 4 |
88 | 115 | 75 | 0 | 4 |
93 | 125 | 85 | 0 | 4 |
78 | 130 | 90 | 0 | 5 |
115 | 140 | 110 | 0 | 5 |
102 | 120 | 80 | 0 | 5 |
98 | 140 | 110 | 0 | 5 |
I know I should use some condition on P_ID column, but I am confused.
Thanks for the help.
CodePudding user response:
Use numpy.random.choice
for random P_ID
and filter in Series.isin
with boolean indexing
:
df2 = df[df['P_ID'].isin(np.random.choice(df['P_ID'].unique(), size=2932, replace=False))]
Alternative:
df2 = df[df['P_ID'].isin(df['P_ID'].drop_duplicates().sample(n=2932))]
EDIT: For random positions use:
df1 = df['P_ID'].drop_duplicates().sample(n=2932).to_frame('P_ID')
df2 = df.merge(df1, how='right')