I have a dataframe with 2 columns:
- employee_num (from 0 to 4999)
- risk, which only has 3 values - high, med and low. They are randomly distributed between the dataframe.
I need to add a third column, called checker, which has 5 values - 1,2,4,6,8. Those checkers need to be evenly distributed between the risk values. There can be a difference of 1, since the risk levels not necessarily divided by 5.
For example, if my df is 5000 rows, and the risk is distributed: high 2537 low 1470 med 993
Then high would by distributed: 1:507, 2:507, 4:507, 6:508, 8:508 >> 2537 overall.
I tried the following:
workers = pd.DataFrame()
workers['employee_num'] = np.arange(5000)
risk = ['high', 'med', 'low']
workers['risk'] = np.random.choice(risk, size=len(workers), p=[0.5,0.2,0.3])
workers.groupby('risk').size()
worker = workers.copy()
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
df5 = pd.DataFrame()
And then:
def func(df1, df2, df3, df4, df5):
grp = [1,2,4,6,8]
for i, df in enumerate([df1, df2, df3, df4, df5]):
df = worker.groupby('risk').sample(frac=(len(worker)/len(grp))/len(worker), replace=False, random_state=0)
print(1, worker.shape)
print(2, df.shape)
worker.drop(df.index.values, axis=0, inplace=True)
print(3, worker.shape)
df['checker'] = grp[i]
yield df
But after i concat it back together, i'm left with just part of the data i started with:
df1, df2, df3, df4, df5 = func(df1, df2, df3, df4, df5)
worker = pd.concat([df1, df2, df3, df4, df5],axis=0)
I would be most grateful for a solution to this (been sitting on this for hours). Thank you!
CodePudding user response:
This is what I've come up with, I hope it helps:
df['checker'] = pd.Series()
for i in risk:
workers.checker[workers.risk==i] = pd.qcut(workers.employee_num[workers.risk==i],5,labels=[1,2,4,6,8])
CodePudding user response:
We can groupby
by risk and then cycle through 0..4 in each group mapping to the 'checker' values:
cm = {0:1,1:2,2:4,3:6,4:8}
workers['checker'] = (workers.groupby('risk').cumcount()%5).map(cm)
workers
looks like this:
employee_num risk checker
-- -------------- ------ ---------
0 0 high 1
1 1 low 1
2 2 med 1
3 3 high 2
4 4 high 4
5 5 low 2
6 6 high 6
7 7 med 2
8 8 high 8
9 9 med 4
10 10 med 6
11 11 med 8
12 12 low 4
13 13 med 1
14 14 high 1
15 15 high 2
16 16 med 2
17 17 high 4
18 18 med 4
19 19 med 6
...
let's look at the statistics:
workers.groupby(['risk','checker']).size()
output:
risk checker
high 1 510
2 510
4 510
6 509
8 509
low 1 296
2 296
4 296
6 295
8 295
med 1 195
2 195
4 195
6 195
8 194