I would like to sample rows by group from a database. But the size of each sample must be proportional to the number of rows in each group.
It concerns a list of projects implemented in different countries and over different years (these are my groups). And I would like to sample projects from the list, proportional to the total number of projects in each group.
The table below shows the count and proportion of projects implemented.
Hence for instance I want to sample 2 projects out of the 10 projects implemented in Burkina Faso in 2016.
I'm trying with the .sample()
function and the .groupby()
function, but I don't know how to use these two together?
CodePudding user response:
If df1
is DataFrame fro picture and df
is original DataFrame use DataFrame.join
:
df = df.join(df1['Percent of Project'],
on=['Initial Financial Year','Area of Intervention'])
Then use GroupBy.apply
with lambda function and DataFrame.sample
:
f = lambda x: x.sample(x['Percent of Project'].iat[0])
df = (df.groupby(['Initial Financial Year','Area of Intervention'], group_keys=False)
.apply(f))