I have a dataframe that looks like this:
dict1 = {'category': {0: 0.0, 1: 1.0, 2: 0.0, 3: 1.0, 4: 0.0},
'Id': {0: 24108, 1: 24307, 2: 24307, 3: 24411, 4: 24411},
'count': {0: 3, 1: 2, 2: 33, 3: 98, 4: 33},
'weight': {0: 0.5, 1: 0.2, 2: 0.7, 3: 1.2, 4: 0.39}}
df1 = pd.DataFrame(dict1)
category Id count weight
0 0.0 24108 3 0.50
1 1.0 24307 2 0.20
2 0.0 24307 33 0.70
3 1.0 24411 98 1.20
4 0.0 24411 33 0.39
There are about 1000 such entries. Some only have 1 category
tag, as is the example for Id 24108
. Other Ids have two entries because there are two category
tags, such as Id
24307 and 24411.
I want to generate a new column called val
that follows one of 3 rules, depending on certain conditions.
If an Id
only has 1 tag associated with it, such as Id 24108
, then the value in the new column val
should be a simple product between count
and weight
.
If an Id
has 2 tags, such as Id
24307 or 24411, then the script should look at which one of the two entries has a larger count
value first.
For Id
24307, the count for category
0 is higher than the count for category
1, therefore the val
column should be 2 * (0.2 1) next to category
1 and 33 * (1-0.7) next to category
0
For Id
24411, the count for category
0 is lower than the count for category
1, therefore the val
column should be 98 * (1-1.2) next to category
1 and 33 * (0.3 1) next to category
0.
The expected output should be like this:
category Id count weight val
0 0.0 24108 3 0.50 1.50
1 1.0 24307 2 0.20 2.40
2 0.0 24307 33 0.70 9.90
3 1.0 24411 98 1.20 -19.60
4 0.0 24411 33 0.39 42.90
CodePudding user response:
You can use boolean arithmetics and numpy.select
:
g = df1.groupby('Id')
# number of categories
n = g['category'].transform('nunique')
# max count
m = g['count'].transform('max')
# selection based on conditions
df1['val'] = np.select(
[n.eq(1), df1['count'].eq(m)], # case if unique, case if max
# formula for unique, formula for max
[df1['count']*df1['weight'], df1['count']*(1-df1['weight'])],
# default value (i.e. not unique and non-max count)
df1['count']*(1 df1['weight'])
)
output:
category Id count weight val
0 0.0 24108 3 0.50 1.50
1 1.0 24307 2 0.20 2.40
2 0.0 24307 33 0.70 9.90
3 1.0 24411 98 1.20 -19.60
4 0.0 24411 33 0.39 45.87
If you really have only 2 categories, then you can simplify the arithmetic to consider the non max as min. It's a bit quite hacky but should work fine:
g = df1.groupby('Id')
# 0 if unique category else 1
n = g['category'].transform('nunique').gt(1).astype(int)
# -1 if count is min of group else 1
m = (df1['count'] != g['count'].transform('min'))*2-1
df1['val'] = df1['count'] * (n-m*df1['weight'])
CodePudding user response:
You can try this:
gbct = df1.groupby('Id')['count'].transform
df1['val'] = df1['count'] * (gbct('size') - 1
(3 - 2*gbct('rank', 'first')) * df1['weight'])
Assumptions
- There are only 1-row or 2-row groups per
Id
. - That's it (e.g.:
count
can be the same for two rows of anId
--in which case the first row will get thecount * (1 weight)
value and the other will getcount * (1 - weight)
).
Explanation
Consider this:
>>> df1.assign(
a=gbct('size') - 1,
s=3 - 2*gbct('rank', 'first'),
val=df1['count'] * (gbct('size') - 1 (3 - 2*gbct('rank', 'first')) * df1['weight'])
)
category Id count weight a s val
0 0.0 24108 3 0.50 0 1.0 1.50
1 1.0 24307 2 0.20 1 1.0 2.40
2 0.0 24307 33 0.70 1 -1.0 9.90
3 1.0 24411 98 1.20 1 -1.0 -19.60
4 0.0 24411 33 0.39 1 1.0 45.87
In the above, a
is 1 for 2-row Id
s and 0 for single row ones. s
is a sign for weight
. The formula for val
is then simply:
val = count * (a s * weight)
Note what happens if an Id
has the same count for each row:
>>> tmp = df1.copy()
>>> tmp.loc[3, 'count'] = 33
>>> gbct = tmp.groupby('Id')['count'].transform
>>> tmp.assign(
... a=gbct('size') - 1,
... s=3 - 2*gbct('rank', 'first'),
... val=tmp['count'] * (gbct('size') - 1
... (3 - 2*gbct('rank', 'first')) * tmp['weight'])
... )
category Id count weight a s val
0 0.0 24108 3 0.50 0 1.0 1.50
1 1.0 24307 2 0.20 1 1.0 2.40
2 0.0 24307 33 0.70 1 -1.0 9.90
3 1.0 24411 33 1.20 1 1.0 72.60
4 0.0 24411 33 0.39 1 -1.0 20.13