Home > Mobile >  Sample Pandas Dataframe with equal number based on binary column
Sample Pandas Dataframe with equal number based on binary column

Time:04-12

I have a dataframe with a data column, and a value column, as in the example below. The value column is always binary, 0 or 1:

data,value
173,1
1378,0
926,0
643,0
1279,0
472,0
706,0
1345,0
1167,1
1401,1
1236,0
447,1
1204,1
398,0
714,0
734,0
1732,0
98,0
1696,0
160,0
1611,0
274,1
562,0
625,0
1028,0
1766,0
511,0
1691,0
898,1

I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1 class, I'll need to use that one as a reference. In turn, if I have less 0 classes, I need to use that.

Any clues on how to do this? I'm working on a jupyter notebook, Python 3.6 (I cannot go up versions).

CodePudding user response:

Sample data

data = [173,926,634,706,398]
value = [1,0,0,1,0]

df = pd.DataFrame({"data": data, "value": value})

print(df)

# data  value 
# 0   173      1
# 1   926      0
# 2   634      0
# 3   706      1
# 4   398      0

Filter to two DFs

ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]

print(ones)
print()
print()
print(zeros)

# data  value 
# 0   173      1
# 3   706      1


# data  value 
# 1   926      0
# 2   634      0
# 4   398      0

Truncate as required

Find the minimum and then truncate it (take n first rows)

if len(ones) <= len(zeros):
  zeros = zeros.iloc[:len(ones), :]
else:
  ones = ones.iloc[:len(zeros), :]

print(ones)
print()
print()
print(zeros)

# data  value 
# 0   173      1
# 3   706      1
#
#
# data  value 
# 1   926      0
# 2   634      0

CodePudding user response:

Group your dataframe by values, and then take a sample of the smallest count from each group.

grouped = df.groupby(['value'])
smallest = grouped.count().min().values
try: # Pandas 1.1.0 
   print(grouped.sample(smallest))
except AttributeError: # Pre-Pandas 1.1.0
   print(grouped.apply(lambda df: df.sample(smallest)))

Output:

    data  value
25  1766      0
3    643      0
10  1236      0
1   1378      0
14   714      0
6    706      0
24  1028      0
8   1167      1
9   1401      1
0    173      1
12  1204      1
11   447      1
28   898      1
21   274      1

CodePudding user response:

This should do it.

df.groupby('value').sample(df.groupby('value').size().min())
  • Related