How to down sample a dataframe in Python based on condition-CodePudding

I am new here so don't know how to use this site.

I have a timeseries data of 37404 ICU Patients. Each patient have multiple rows. I want to down sample my dataframe and select only 2932 patients (all rows of the respective patient ID). Can anyone help me? My data looks like this:

HR	SBP	DBP	P_ID
92	120	80	0
98	115	85	0
93	125	75	1
95	130	90	1
102	120	80	1
109	115	75	2
94	135	100	2
97	100	70	3
85	120	80	4
88	115	75	4
93	125	85	4
78	130	90	5
115	140	110	5
102	120	80	5
98	140	110	5

I know I should use some condition on P_ID column, but I am confused.

Thanks for the help.

CodePudding user response：

Use numpy.random.choice for random P_ID and filter in Series.isin with boolean indexing:

df2 = df[df['P_ID'].isin(np.random.choice(df['P_ID'].unique(), size=2932, replace=False))]

Alternative:

df2 = df[df['P_ID'].isin(df['P_ID'].drop_duplicates().sample(n=2932))]

EDIT: For random positions use:

df1 = df['P_ID'].drop_duplicates().sample(n=2932).to_frame('P_ID')

df2 = df.merge(df1, how='right')

HR	SBP	DBP	P_ID
92	120	80	0
98	115	85	0
93	125	75	1
95	130	90	1
102	120	80	1
109	115	75	2
94	135	100	2
97	100	70	3
85	120	80	4
88	115	75	4
93	125	85	4
78	130	90	5
115	140	110	5
102	120	80	5
98	140	110	5

HR	SBP	DBP	P_ID
92	120	80	0
98	115	85	0
93	125	75	1
95	130	90	1
102	120	80	1
109	115	75	2
94	135	100	2
97	100	70	3
85	120	80	4
88	115	75	4
93	125	85	4
78	130	90	5
115	140	110	5
102	120	80	5
98	140	110	5

HR	SBP	DBP	P_ID
92	120	80	0
98	115	85	0
93	125	75	1
95	130	90	1
102	120	80	1
109	115	75	2
94	135	100	2
97	100	70	3
85	120	80	4
88	115	75	4
93	125	85	4
78	130	90	5
115	140	110	5
102	120	80	5
98	140	110	5