Create a column based on conditions and calculation-CodePudding

Below is my dataframe:

df = pd.DataFrame({"ID" : [1, 1, 2, 2, 2, 3, 3],
                  "length" : [0.7, 0.7, 0.8, 0.6, 0.6, 0.9, 0.9],
                  "comment" : ["typed", "handwritten", "typed", "typed", "handwritten", "handwritten", "handwritten"]})
df

    ID  length  comment
0   1   0.7     typed
1   1   0.7     handwritten
2   2   0.8     typed
3   2   0.6     typed
4   2   0.6     handwritten
5   3   0.9     handwritten
6   3   0.9     handwritten

I want to be able to do the following:

For any group of ID, if the length are the same but the comments are different, use the "typed" formula (5 x length) for the calculated length of that group of ID, otherwise use the formula that apply to each comment to get the calculated length. typed = 5 x length, handwritten = 7*length.

Required Output will be as below:

    ID  length  comment         Calculated Length
0   1   0.7     typed           5*length
1   1   0.7     handwritten     5*length
2   2   0.8     typed           5*length
3   2   0.6     typed           5*length
4   2   0.6     handwritten     7*length
5   3   0.9     handwritten     7*length
6   3   0.9     handwritten     7*length

Thank you.

CodePudding user response：

Find the IDs that satisfy the special condition using groupby. Using the IDs and the comment, compute the Calculated length using np.where as follows

>>> grp_ids = df.groupby("ID")[["length", "comment"]].nunique()
>>> grp_ids
    length  comment
ID
1        1        2
2        2        2
3        1        1
>>> idx = grp_ids.index[(grp_ids["length"] == 1) & (grp_ids["comment"] != 1)]
>>> idx
Int64Index([1], dtype='int64', name='ID')
>>> df["Calculated length"] = np.where(
        df["ID"].isin(idx) | (df["comment"] == "typed"),
        df["length"] * 5,
        df["length"] * 7
    )
>>> df
   ID  length      comment  Calculated length
0   1     0.7        typed                3.5
1   1     0.7  handwritten                3.5
2   2     0.8        typed                4.0
3   2     0.6        typed                3.0
4   2     0.6  handwritten                4.2
5   3     0.9  handwritten                6.3
6   3     0.9  handwritten                6.3

CodePudding user response：

use np.where if comment column exist only typed or handwritten.

import numpy as np
cond1 = df['comment'] == 'typed'
df.assign(Calculated_Length=np.where(cond1, df['length'] * 5, df['length'] * 7))

output:

    ID  length  comment     Calculated_Length
0   1   0.7     typed       3.5
1   1   0.7     handwritten 4.9
2   2   0.8     typed       4.0
3   2   0.6     typed       3.0
4   2   0.6     handwritten 4.2
5   3   0.9     handwritten 6.3
6   3   0.9     handwritten 6.3

edit after comment

cond1 = df['comment'] == 'typed'
cond2 = df.groupby('ID')['length'].transform(lambda x: (x.max() == x.min()) & (df.loc[x.index, 'comment'].eq('typed').sum() > 0))
df.assign(Caculated_Length=np.where((cond1 | cond2), df['length']*5, df['length']*7))

output:

    ID  length  comment     Caculated_Length
0   1   0.7     typed       3.5
1   1   0.7     handwritten 3.5
2   2   0.8     typed       4.0
3   2   0.6     typed       3.0
4   2   0.6     handwritten 4.2
5   3   0.9     handwritten 6.3
6   3   0.9     handwritten 6.3