Home > Net >  Iterate through pandas df rows to calculate a new variable depending on conditions
Iterate through pandas df rows to calculate a new variable depending on conditions

Time:07-28

I have a dataframe that looks like this:

dict1 = {'category': {0: 0.0, 1: 1.0, 2: 0.0, 3: 1.0, 4: 0.0},
 'Id': {0: 24108, 1: 24307, 2: 24307, 3: 24411, 4: 24411},
 'count': {0: 3, 1: 2, 2: 33, 3: 98, 4: 33},
 'weight': {0: 0.5, 1: 0.2, 2: 0.7, 3: 1.2, 4: 0.39}}

df1 = pd.DataFrame(dict1)

    category    Id  count   weight
0   0.0      24108    3     0.50
1   1.0      24307    2     0.20
2   0.0      24307   33     0.70
3   1.0      24411   98     1.20
4   0.0      24411   33     0.39

There are about 1000 such entries. Some only have 1 category tag, as is the example for Id 24108. Other Ids have two entries because there are two category tags, such as Id 24307 and 24411.

I want to generate a new column called val that follows one of 3 rules, depending on certain conditions.

If an Id only has 1 tag associated with it, such as Id 24108, then the value in the new column val should be a simple product between count and weight.

If an Id has 2 tags, such as Id 24307 or 24411, then the script should look at which one of the two entries has a larger count value first.

For Id 24307, the count for category 0 is higher than the count for category 1, therefore the val column should be 2 * (0.2 1) next to category 1 and 33 * (1-0.7) next to category 0

For Id 24411, the count for category 0 is lower than the count for category 1, therefore the val column should be 98 * (1-1.2) next to category 1 and 33 * (0.3 1) next to category 0.

The expected output should be like this:

    category    Id  count   weight  val
0   0.0      24108    3     0.50   1.50
1   1.0      24307    2     0.20   2.40
2   0.0      24307   33     0.70   9.90
3   1.0      24411   98     1.20 -19.60
4   0.0      24411   33     0.39  42.90

CodePudding user response:

You can use boolean arithmetics and numpy.select:

g = df1.groupby('Id')

# number of categories
n = g['category'].transform('nunique')
# max count
m = g['count'].transform('max')

# selection based on conditions
df1['val'] = np.select(
          [n.eq(1), df1['count'].eq(m)], # case if unique, case if max
          # formula for unique, formula for max
          [df1['count']*df1['weight'], df1['count']*(1-df1['weight'])],
          # default value (i.e. not unique and non-max count)
          df1['count']*(1 df1['weight'])
         )

output:

   category     Id  count  weight    val
0       0.0  24108      3    0.50   1.50
1       1.0  24307      2    0.20   2.40
2       0.0  24307     33    0.70   9.90
3       1.0  24411     98    1.20 -19.60
4       0.0  24411     33    0.39  45.87

If you really have only 2 categories, then you can simplify the arithmetic to consider the non max as min. It's a bit quite hacky but should work fine:

g = df1.groupby('Id')

# 0 if unique category else 1
n = g['category'].transform('nunique').gt(1).astype(int)
# -1 if count is min of group else 1
m = (df1['count'] != g['count'].transform('min'))*2-1

df1['val'] = df1['count'] * (n-m*df1['weight'])

CodePudding user response:

You can try this:

gbct = df1.groupby('Id')['count'].transform
df1['val'] = df1['count'] * (gbct('size') - 1   
                             (3 - 2*gbct('rank', 'first')) * df1['weight'])

Assumptions

  • There are only 1-row or 2-row groups per Id.
  • That's it (e.g.: count can be the same for two rows of an Id --in which case the first row will get the count * (1 weight) value and the other will get count * (1 - weight)).

Explanation

Consider this:

>>> df1.assign(
    a=gbct('size') - 1,
    s=3 - 2*gbct('rank', 'first'),
    val=df1['count'] * (gbct('size') - 1   (3 - 2*gbct('rank', 'first')) * df1['weight'])
)
   category     Id  count  weight  a    s    val
0       0.0  24108      3    0.50  0  1.0   1.50
1       1.0  24307      2    0.20  1  1.0   2.40
2       0.0  24307     33    0.70  1 -1.0   9.90
3       1.0  24411     98    1.20  1 -1.0 -19.60
4       0.0  24411     33    0.39  1  1.0  45.87

In the above, a is 1 for 2-row Ids and 0 for single row ones. s is a sign for weight. The formula for val is then simply:

val = count * (a   s * weight)

Note what happens if an Id has the same count for each row:

>>> tmp = df1.copy()
>>> tmp.loc[3, 'count'] = 33
>>> gbct = tmp.groupby('Id')['count'].transform
>>> tmp.assign(
...     a=gbct('size') - 1,
...     s=3 - 2*gbct('rank', 'first'),
...     val=tmp['count'] * (gbct('size') - 1  
...                         (3 - 2*gbct('rank', 'first')) * tmp['weight'])
... )
   category     Id  count  weight  a    s    val
0       0.0  24108      3    0.50  0  1.0   1.50
1       1.0  24307      2    0.20  1  1.0   2.40
2       0.0  24307     33    0.70  1 -1.0   9.90
3       1.0  24411     33    1.20  1  1.0  72.60
4       0.0  24411     33    0.39  1 -1.0  20.13
  • Related