Substract the highest value within row containing a pattern against the highest value of row not con-CodePudding

I have a dataframe such as :

Groups Name               Value
G1     BLOC_Homo_sapiens  100
G1     BLOC_Chimpenzee    99
G1     BLOC_Bonobo        80
G1     Canis_lupus        20
G1     Danio_rerio        10
G2     BLOC_Homo_sapiens  30
G2     BLOC_Bonobo        29
G2     Mus_musculus       28
G2     Cules_pupiens      26
G3     BLOC_Gorrilla      300
G3     Cimex_lectularius  10
G3     Bombus_terrestris  9

And I would like to add a new column called "diff_length" for each Groups where I substract the highest Value of the Name containing the pattern "BLOC" against the highest Value of the Name which does not contain the pattern "BLOC".

For the Groups1 for instance, the highest Value with the BLOC is 100, and the highest Value without BLOC is 20. So the result is 100-20 = 80.

I should then get :

Groups Name               Value diff_length
G1     BLOC_Homo_sapiens  100   80
G1     BLOC_Chimpenzee    99    80
G1     BLOC_Bonobo        80    80
G1     Canis_lupus        20    80
G1     Danio_rerio        10    80
G2     BLOC_Homo_sapiens  30    2
G2     BLOC_Bonobo        29    2
G2     Mus_musculus       28    2 
G2     Cules_pupiens      26    2
G3     BLOC_Gorrilla      300   290
G3     Cimex_lectularius  10    290
G3     Bombus_terrestris  9     290

CodePudding user response：

You can use:

m = df['Name'].str.contains('BLOC')

df['diff_length'] = (df.groupby('Groups')['Value']
                       .transform(lambda d: d.where(m).max() - d.mask(m).max())
                    )

NB. this assumes unique indices.

Output:

   Groups               Name  Value  diff_length
0      G1  BLOC_Homo_sapiens    100         80.0
1      G1    BLOC_Chimpenzee     99         80.0
2      G1        BLOC_Bonobo     80         80.0
3      G1        Canis_lupus     20         80.0
4      G1        Danio_rerio     10         80.0
5      G2  BLOC_Homo_sapiens     30          2.0
6      G2        BLOC_Bonobo     29          2.0
7      G2       Mus_musculus     28          2.0
8      G2      Cules_pupiens     26          2.0
9      G3      BLOC_Gorrilla    300        290.0
10     G3  Cimex_lectularius     10        290.0
11     G3  Bombus_terrestris      9        290.0

Alternative syntax:

m = df['Name'].str.contains('BLOC')

df['diff_length'] = (
  df['Value'].where(m).groupby(df['Groups']).transform('max')
 -df['Value'].mask(m).groupby(df['Groups']).transform('max')
)

CodePudding user response：

here is one way to do it

# identify rows that contains the block
m1=df['Name'].str.contains('BLOC')

# groupby on Groups and the rows that has BLOC and ones that don't
# take the max for each (TRUE/FALSE) in a group, and take diff
df2=df.groupby(['Groups', m1 ] )['Value'].max().diff().reset_index()

# create a dictionary
d=dict(df2[df2['Name'].eq(True)][['Groups','Value']].values)

# map difference back to the df
df['diff_length'] = df['Groups'].map(d)
df

    Groups  Name          Value      diff_length
0   G1  BLOC_Homo_sapiens   100      80.0
1   G1  BLOC_Chimpenzee      99      80.0
2   G1  BLOC_Bonobo          80      80.0
3   G1  Canis_lupus          20      80.0
4   G1  Danio_rerio          10      80.0
5   G2  BLOC_Homo_sapiens    30       2.0
6   G2  BLOC_Bonobo          29       2.0
7   G2  Mus_musculus         28       2.0
8   G2  Cules_pupiens        26       2.0
9   G3  BLOC_Gorrilla       300     290.0
10  G3  Cimex_lectularius    10     290.0
11  G3  Bombus_terrestris     9     290.0