Below is my dataframe:
df = pd.DataFrame({"ID" : [1, 1, 2, 2, 2, 3, 3],
"length" : [0.7, 0.7, 0.8, 0.6, 0.6, 0.9, 0.9],
"comment" : ["typed", "handwritten", "typed", "typed", "handwritten", "handwritten", "handwritten"]})
df
ID length comment
0 1 0.7 typed
1 1 0.7 handwritten
2 2 0.8 typed
3 2 0.6 typed
4 2 0.6 handwritten
5 3 0.9 handwritten
6 3 0.9 handwritten
I want to be able to do the following:
For any group of ID, if the length are the same but the comments are different, use the "typed" formula (5 x length) for the calculated length of that group of ID, otherwise use the formula that apply to each comment to get the calculated length. typed = 5 x length, handwritten = 7*length.
Required Output will be as below:
ID length comment Calculated Length
0 1 0.7 typed 5*length
1 1 0.7 handwritten 5*length
2 2 0.8 typed 5*length
3 2 0.6 typed 5*length
4 2 0.6 handwritten 7*length
5 3 0.9 handwritten 7*length
6 3 0.9 handwritten 7*length
Thank you.
CodePudding user response:
Find the IDs
that satisfy the special condition using groupby
. Using the IDs
and the comment
, compute the Calculated length
using np.where
as follows
>>> grp_ids = df.groupby("ID")[["length", "comment"]].nunique()
>>> grp_ids
length comment
ID
1 1 2
2 2 2
3 1 1
>>> idx = grp_ids.index[(grp_ids["length"] == 1) & (grp_ids["comment"] != 1)]
>>> idx
Int64Index([1], dtype='int64', name='ID')
>>> df["Calculated length"] = np.where(
df["ID"].isin(idx) | (df["comment"] == "typed"),
df["length"] * 5,
df["length"] * 7
)
>>> df
ID length comment Calculated length
0 1 0.7 typed 3.5
1 1 0.7 handwritten 3.5
2 2 0.8 typed 4.0
3 2 0.6 typed 3.0
4 2 0.6 handwritten 4.2
5 3 0.9 handwritten 6.3
6 3 0.9 handwritten 6.3
CodePudding user response:
use np.where
if comment column exist only typed or handwritten.
import numpy as np
cond1 = df['comment'] == 'typed'
df.assign(Calculated_Length=np.where(cond1, df['length'] * 5, df['length'] * 7))
output:
ID length comment Calculated_Length
0 1 0.7 typed 3.5
1 1 0.7 handwritten 4.9
2 2 0.8 typed 4.0
3 2 0.6 typed 3.0
4 2 0.6 handwritten 4.2
5 3 0.9 handwritten 6.3
6 3 0.9 handwritten 6.3
edit after comment
cond1 = df['comment'] == 'typed'
cond2 = df.groupby('ID')['length'].transform(lambda x: (x.max() == x.min()) & (df.loc[x.index, 'comment'].eq('typed').sum() > 0))
df.assign(Caculated_Length=np.where((cond1 | cond2), df['length']*5, df['length']*7))
output:
ID length comment Caculated_Length
0 1 0.7 typed 3.5
1 1 0.7 handwritten 3.5
2 2 0.8 typed 4.0
3 2 0.6 typed 3.0
4 2 0.6 handwritten 4.2
5 3 0.9 handwritten 6.3
6 3 0.9 handwritten 6.3