How to randomly sample and keep only n values of repeating IDs?-CodePudding

I have a data frame that looks like this:

user_id	tweet_id	tweet
user123	7658j	dogs are super
user245	66721	yes dogs are super
user245	6d343	yes cats are also super
<...>	<...>	<...>
user245	541238	well I developed allergy on cates

As I check value counts for each user, I have the following results:

id	count
user245	456
user123	115
user427	2

I want to subset the data this way that I keep all rows of ids with value counts below 100, and keep 100 randomly sampled rows of the rows with ids where value counts is above 100?

CodePudding user response：

You can try:

(df.groupby('user_id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 100)))
)

Example (with n=3):

df = pd.DataFrame({'id': list('AAAAAABBCDDDD'), 'col': range(13)})
(df.groupby('id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 3)))
)

Output: