Home > database >  How to generate weights for pandas dataframe column?
How to generate weights for pandas dataframe column?

Time:03-11

I have the following pandas DataFrame df:

col1   col2
0.2    0
0.1    1
0.6    1
0.3    1
0.5    0
0.2    0
0.3    1
0.5    1
0.7    1
0.1    1

I need to generate a new column col3 based on col2 values. The logic should be the following:

  • Each batch of sequential values of 1 should get weights between 1 and 0.

This is the expected result:

col1   col2   col3
0.2    0      0.0
0.1    1      1.0
0.6    1      0.66
0.3    1      0.33 
0.5    0      0.0
0.2    0      0.0
0.3    1      1.0
0.5    1      0.75
0.7    1      0.50
0.1    1      0.25

CodePudding user response:

Here's one approach:

groups = df['col2'].eq(0).cumsum()
g = df['col2'].eq(1).groupby(groups)
df['col3'] = g.cumsum().div(g.transform('sum')).fillna(0)
df.loc[df['col2']==1, 'col3'] = df['col3'].groupby(groups).apply(lambda x: x.iloc[1:][::-1]).to_numpy()

Output:

   col1  col2      col3
0   0.2     0  0.000000
1   0.1     1  1.000000
2   0.6     1  0.666667
3   0.3     1  0.333333
4   0.5     0  0.000000
5   0.2     0  0.000000
6   0.3     1  1.000000
7   0.5     1  0.750000
8   0.7     1  0.500000
9   0.1     1  0.250000

CodePudding user response:

Use:

weights = lambda x:1 - (x / x.size).cumsum().shift(fill_value=0)
df['col3'] = df.groupby(df['col2'].eq(0).cumsum().mask(df['col2'].eq(0)))['col2'] \
               .apply(weights).reindex(df.index, fill_value=0)
print(df)

# Output
   col1  col2      col3
0   0.2     0  0.000000
1   0.1     1  1.000000
2   0.6     1  0.666667
3   0.3     1  0.333333
4   0.5     0  0.000000
5   0.2     0  0.000000
6   0.3     1  1.000000
7   0.5     1  0.750000
8   0.7     1  0.500000
9   0.1     1  0.250000

How to group?

>>> df.assign(group=df['col2'].eq(0).cumsum().mask(df['col2'].eq(0)))
   col1  col2  group
0   0.2     0    NaN
1   0.1     1    1.0  # First group, 3 consecutive 1
2   0.6     1    1.0
3   0.3     1    1.0
4   0.5     0    NaN
5   0.2     0    NaN
6   0.3     1    3.0  # Second group, 4 consecutive 1
7   0.5     1    3.0
8   0.7     1    3.0
9   0.1     1    3.0

CodePudding user response:

Quick and Dirty

mask = df.col2.ne(1)
grps = mask.cumsum().mask(mask, 0)
gb = grps.groupby(grps)
df.assign(col3=(1 - gb.cumcount() / gb.transform('size')).mask(mask, 0))

   col1  col2      col3
0   0.2     0  0.000000
1   0.1     1  1.000000
2   0.6     1  0.666667
3   0.3     1  0.333333
4   0.5     0  0.000000
5   0.2     0  0.000000
6   0.3     1  1.000000
7   0.5     1  0.750000
8   0.7     1  0.500000
9   0.1     1  0.250000

Or the same thing with the little known groupby pipe

mask = df.col2.ne(1)
grps = mask.cumsum().mask(mask, 0)
func = lambda g: g.cumcount() / g.transform('size')
df.assign(col3=(1 - grps.groupby(grps).pipe(func).mask(mask, 0)))
  • Related