Sample from dataframe with conditions-CodePudding

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'

What I have:

df['target'].value_counts()

0 = 4000
1 = 120000

What I need:

new_df['target'].value_counts()

0 = 4000
1 = 6000

I know I can df.sample but I dont know how to insert the conditional.

Thanks

CodePudding user response：

Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:

df.groupby('target').sample(4000)

Demo:

df = pd.DataFrame({'x': [0] * 10   [1] * 25})

df.groupby('x').sample(5)
x
8   0
6   0
7   0
2   0
9   0
18  1
33  1
24  1
32  1
15  1

If you need to sample conditionally based on the group value, you can do:

df.groupby('target', group_keys=False).apply(
  lambda g: g.sample(4000 if g.name == 0 else 6000)
)

Demo:

df.groupby('x', group_keys=False).apply(
  lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7   0
8   0
2   0
1   0
18  1
12  1
17  1
22  1
30  1
28  1

CodePudding user response：

Assuming the following input and using the values 4/6 instead of 4000/6000:

df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})

You could groupby your target and sample to take at most N values per group:

df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))

example output:

If you want the same size you can simply use df.groupby('target').sample(n=4)