Home > Blockchain >  How to down sample a dataframe in Python based on condition
How to down sample a dataframe in Python based on condition

Time:11-15

I am new here so don't know how to use this site.

I have a timeseries data of 37404 ICU Patients. Each patient have multiple rows. I want to down sample my dataframe and select only 2932 patients (all rows of the respective patient ID). Can anyone help me? My data looks like this:

HR SBP DBP Sepsis P_ID
92 120 80 0 0
98 115 85 0 0
93 125 75 0 1
95 130 90 0 1
102 120 80 0 1
109 115 75 0 2
94 135 100 0 2
97 100 70 0 3
85 120 80 0 4
88 115 75 0 4
93 125 85 0 4
78 130 90 0 5
115 140 110 0 5
102 120 80 0 5
98 140 110 0 5

I know I should use some condition on P_ID column, but I am confused.

Thanks for the help.

CodePudding user response:

Use numpy.random.choice for random P_ID and filter in Series.isin with boolean indexing:

df2 = df[df['P_ID'].isin(np.random.choice(df['P_ID'].unique(), size=2932, replace=False))]

Alternative:

df2 = df[df['P_ID'].isin(df['P_ID'].drop_duplicates().sample(n=2932))]

EDIT: For random positions use:

df1 = df['P_ID'].drop_duplicates().sample(n=2932).to_frame('P_ID')

df2 = df.merge(df1, how='right')
  • Related