Home > Software design >  How to randomly sample and keep only n values of repeating IDs?
How to randomly sample and keep only n values of repeating IDs?

Time:08-24

I have a data frame that looks like this:

user_id tweet_id tweet
user123 7658j dogs are super
user245 66721 yes dogs are super
user245 6d343 yes cats are also super
<...> <...> <...>
user245 541238 well I developed allergy on cates

As I check value counts for each user, I have the following results:

id count
user245 456
user123 115
user427 2

I want to subset the data this way that I keep all rows of ids with value counts below 100, and keep 100 randomly sampled rows of the rows with ids where value counts is above 100?

CodePudding user response:

You can try:

(df.groupby('user_id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 100)))
)

Example (with n=3):

df = pd.DataFrame({'id': list('AAAAAABBCDDDD'), 'col': range(13)})
(df.groupby('id', group_keys=False)
   .apply(lambda g: g.sample(n=min(len(g), 3)))
)

Output:

   id  col
0   A    0
4   A    4
3   A    3
7   B    7
6   B    6
8   C    8
12  D   12
11  D   11
9   D    9
  • Related