I would like to random select rows proportionate to the number of unique values in column "ID" grouping by column "Team". Further, I would like to only retrieve 8 total rows. I have:
| ID | Team | Color |
| ----- | ----- | ------------ |
| 1 | A | Blue |
| 2 | B | Red |
| 2 | B | Green |
| 3 | A | Blue |
| 6 | C | Red |
| 1 | B | Yellow |
| 2 | B | Green |
| 9 | A | Blue |
| 6 | C | Red |
| 1 | B | Yellow |
| 9 | A | Blue |
| 1 | A | Purple |
Only the proportions are looking at unique values. The rows pulled do not necessarily need to be unique in anyway. Using the above table the proportions would be:
| Team | Unique IDs | Proportion | Number selected |
| ------ | ---------- | ----------- | ---------------- |
| A | 3 | 0.500 | 4 |
| B | 2 | 0.333 | 3 |
| C | 1 | 0.167 | 1 |
So since I want 8 total rows selected proportionately, I should end up with something like the following:
| ID | Team | Color |
| ----- | ----- | ------------ |
| 1 | A | Blue |
| 3 | A | Blue |
| 9 | A | Blue |
| 1 | A | Purple |
| 2 | B | Green |
| 2 | B | Red |
| 1 | B | Yellow |
| 6 | C | Red |
CodePudding user response:
- Calculate unique_counts - the number of unique
'ID'
s in each'Team'
group, - convert unique_counts into nums_selected - the number of elements for each group that have to be selected
- use nums_selected to
.sample
this many elements from each'Team'
group:
n_total = 8
unique_counts = df.groupby('Team')['ID'].agg('nunique')
nums_selected= np.floor(unique_counts / unique_counts.sum() * n_total).astype(int) # rounded down
df.groupby('Team', group_keys=False).apply( # for each 'Team' group:
lambda x: x.sample(n=nums_selected[x.name], # sample this many rows
replace=True) # (with replacement)
)
Note:
The result can contain less elements than n_total because nums_selected are rounded down when converting from
float
toint
. However, you may use any method to do this conversion:np.ceil
,pd.Series.round
, or any other method you choose to your liking.