Home > OS >  Evenly distribute samples within a dataframe
Evenly distribute samples within a dataframe

Time:03-04

I have a dataframe with 2 columns:

  1. employee_num (from 0 to 4999)
  2. risk, which only has 3 values - high, med and low. They are randomly distributed between the dataframe.

I need to add a third column, called checker, which has 5 values - 1,2,4,6,8. Those checkers need to be evenly distributed between the risk values. There can be a difference of 1, since the risk levels not necessarily divided by 5.

For example, if my df is 5000 rows, and the risk is distributed: high 2537 low 1470 med 993

Then high would by distributed: 1:507, 2:507, 4:507, 6:508, 8:508 >> 2537 overall.

I tried the following:

workers = pd.DataFrame()
workers['employee_num'] = np.arange(5000)
risk = ['high', 'med', 'low']
workers['risk'] = np.random.choice(risk, size=len(workers), p=[0.5,0.2,0.3])
workers.groupby('risk').size()
worker = workers.copy()


df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame() 

And then:

def func(df1, df2, df3, df4, df5):
    grp = [1,2,4,6,8]
    for i, df in enumerate([df1, df2, df3, df4, df5]):
        df = worker.groupby('risk').sample(frac=(len(worker)/len(grp))/len(worker), replace=False, random_state=0)
        print(1, worker.shape)
        print(2, df.shape)
        worker.drop(df.index.values, axis=0, inplace=True)
        print(3, worker.shape)
        df['checker'] = grp[i]
        yield df

But after i concat it back together, i'm left with just part of the data i started with:

df1, df2, df3, df4, df5 = func(df1, df2, df3, df4, df5)
worker = pd.concat([df1, df2, df3, df4, df5],axis=0)

I would be most grateful for a solution to this (been sitting on this for hours). Thank you!

CodePudding user response:

This is what I've come up with, I hope it helps:

df['checker'] = pd.Series()

for i in risk:
    workers.checker[workers.risk==i] = pd.qcut(workers.employee_num[workers.risk==i],5,labels=[1,2,4,6,8])

CodePudding user response:

We can groupby by risk and then cycle through 0..4 in each group mapping to the 'checker' values:

cm = {0:1,1:2,2:4,3:6,4:8}
workers['checker'] = (workers.groupby('risk').cumcount()%5).map(cm)

workers looks like this:

     employee_num  risk      checker
--  --------------  ------  ---------
 0               0  high            1
 1               1  low             1
 2               2  med             1
 3               3  high            2
 4               4  high            4
 5               5  low             2
 6               6  high            6
 7               7  med             2
 8               8  high            8
 9               9  med             4
10              10  med             6
11              11  med             8
12              12  low             4
13              13  med             1
14              14  high            1
15              15  high            2
16              16  med             2
17              17  high            4
18              18  med             4
19              19  med             6
...

let's look at the statistics:

workers.groupby(['risk','checker']).size()

output:

risk  checker
high  1          510
      2          510
      4          510
      6          509
      8          509
low   1          296
      2          296
      4          296
      6          295
      8          295
med   1          195
      2          195
      4          195
      6          195
      8          194
  • Related