I have a dataframe with a data
column, and a value
column, as in the example below. The value
column is always binary, 0 or 1:
data,value
173,1
1378,0
926,0
643,0
1279,0
472,0
706,0
1345,0
1167,1
1401,1
1236,0
447,1
1204,1
398,0
714,0
734,0
1732,0
98,0
1696,0
160,0
1611,0
274,1
562,0
625,0
1028,0
1766,0
511,0
1691,0
898,1
I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1
class, I'll need to use that one as a reference. In turn, if I have less 0
classes, I need to use that.
Any clues on how to do this? I'm working on a jupyter notebook, Python 3.6 (I cannot go up versions).
CodePudding user response:
Sample data
data = [173,926,634,706,398]
value = [1,0,0,1,0]
df = pd.DataFrame({"data": data, "value": value})
print(df)
# data value
# 0 173 1
# 1 926 0
# 2 634 0
# 3 706 1
# 4 398 0
Filter to two DFs
ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
# data value
# 1 926 0
# 2 634 0
# 4 398 0
Truncate as required
Find the minimum and then truncate it (take n
first rows)
if len(ones) <= len(zeros):
zeros = zeros.iloc[:len(ones), :]
else:
ones = ones.iloc[:len(zeros), :]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
#
#
# data value
# 1 926 0
# 2 634 0
CodePudding user response:
Group your dataframe by values, and then take a sample of the smallest count from each group.
grouped = df.groupby(['value'])
smallest = grouped.count().min().values
try: # Pandas 1.1.0
print(grouped.sample(smallest))
except AttributeError: # Pre-Pandas 1.1.0
print(grouped.apply(lambda df: df.sample(smallest)))
Output:
data value
25 1766 0
3 643 0
10 1236 0
1 1378 0
14 714 0
6 706 0
24 1028 0
8 1167 1
9 1401 1
0 173 1
12 1204 1
11 447 1
28 898 1
21 274 1
CodePudding user response:
This should do it.
df.groupby('value').sample(df.groupby('value').size().min())