I have a data frame that looks like this:
user_id | tweet_id | tweet |
---|---|---|
user123 | 7658j | dogs are super |
user245 | 66721 | yes dogs are super |
user245 | 6d343 | yes cats are also super |
<...> | <...> | <...> |
user245 | 541238 | well I developed allergy on cates |
As I check value counts for each user, I have the following results:
id | count |
---|---|
user245 | 456 |
user123 | 115 |
user427 | 2 |
I want to subset the data this way that I keep all rows of ids with value counts below 100, and keep 100 randomly sampled rows of the rows with ids where value counts is above 100?
CodePudding user response:
You can try:
(df.groupby('user_id', group_keys=False)
.apply(lambda g: g.sample(n=min(len(g), 100)))
)
Example (with n=3):
df = pd.DataFrame({'id': list('AAAAAABBCDDDD'), 'col': range(13)})
(df.groupby('id', group_keys=False)
.apply(lambda g: g.sample(n=min(len(g), 3)))
)
Output:
id col
0 A 0
4 A 4
3 A 3
7 B 7
6 B 6
8 C 8
12 D 12
11 D 11
9 D 9