in my data frame, I have data for 3 months, and it's per day. ( for every day, I have a different number of samples, for example on 1st January I have 20K rows of samples and on the second of January there are 15K samples)
what I need is that I want to take the mean number and apply it to all the data frames. for example, if the mean value is 8K, i want to get the random 8k rows from 1st January data and 8k rows randomly from 2nd January, and so on.
as far as I know, rand() will give the random values of the whole data frame, But I need to apply it per day. since my data frame is on a daily basis and the date is mentioned in a column of the data frame. Thanks
CodePudding user response:
You can use groupby_sample
after computing the mean of records:
# Suppose 'date' is the name of your column
sample = df.groupby('date').sample(n=int(df['date'].value_counts().mean()))
# Or
g = df.groupby('date')
sample = g.sample(n=int(g.size().mean()))
Update
Is there ant solution for the dates that their sum is lower than the mean? I face with this error for those dates: Cannot take a larger sample than population when 'replace=False'
n = np.floor(df['date'].value_counts().mean()).astype(int)
sample = (df.groupby('date').sample(n, replace=True)
.loc[lambda x: ~x.index.duplicated()])