I have a dataset with multiple rows of data for each ID. There are around 5000 IDs, and each ID can have 1 to 22 rows of data, each row belonging to a different group. I want to sample 1 row from each ID, and I want the sampled data to be equally distributed among the groups.
This is a dummy df
, which is simplified so that there are 8 IDs, and each ID can have 1 to 4 rows of data:
id group
1 a
1 b
1 c
1 d
2 a
2 b
3 a
3 b
3 c
3 d
4 a
4 b
4 d
5 a
5 b
5 c
5 d
6 a
6 d
7 a
7 b
7 d
8 a
8 b
8 c
8 d
Since there are 8 IDs and 4 groups, I want the sampled data to have 2 IDs from each group. The number 2 is just because I want an equal distribution among groups, so if there are 20 IDs and 4 groups, I would want the sampled data to have 5 IDs from each group. Also, I want to sample one row from each ID, so all IDs should appear once and only once in the sampled data. Is there a way to do this?
I've tried using weights in pd.DataFrame.sample
, using 1/frequency of each group as the weight, hoping that rows in groups with less frequency will have more weight and therefore have higher chance of being sampled, so that the final sampled data would be roughly equally distributed among groups. But it didn't work as I expected. I tried using different random states, but none of them gave me a sampled dataset that was equally distributed among groups. This is the code I used:
#Create dummy dataframe:
d = {'id': [1,1,1,1,
2,2,
3,3,3,3,
4,4,4,
5,5,5,5,
6,6,
7,7,7,
8,8,8,8],
'group': ['a','b','c','d',
'a','b',
'a','b','c','d',
'a','b','d',
'a','b','c','d',
'a','d',
'a','b','d',
'a','b','c','d']}
df = pd.DataFrame(data=d)
#Calculate weights
df['inverted_freq'] = 1./df.groupby('group')['group'].transform('count')
#Sample one row from each ID
df1 = df.groupby('id').apply(pd.DataFrame.sample, random_state=1, n=1, weights=df.inverted_freq).reset_index(drop=True)
My expected output is:
id group
1 d
2 b
3 a
4 d
5 c
6 a
7 b
8 c
or something similar to this, with one row per ID and an equal number of rows per group.
Suggestions in either R or Python would be greatly appreciated. Thanks!
CodePudding user response:
We can use data.table
library(data.table)
setDT(df1)[df1[, sample(.I, 2), group]$V1]
CodePudding user response:
In R, you can use dplyr::slice_sample
:
library(dplyr)
df %>%
group_by(group) %>%
slice_sample(n = 2)
CodePudding user response:
try
df.groupby("id").sample(n=2)
id group
0 1 a
2 1 c
4 2 a
5 2 b
7 3 b
9 3 d
12 4 d
10 4 a
13 5 a
16 5 d
18 6 d
17 6 a
21 7 d
19 7 a
23 8 b
24 8 c