Home > database >  Pandas: Transform with custom maximum function
Pandas: Transform with custom maximum function

Time:01-03

I know that I can use transform for transforming every element in a group in a dataframe into the minimum value. This is done with something like

df.groupby(level=0).transform('min')

My problem is, that all of my cells are strings, in fact tuplelike strings with floats inside, like "5.48$\pm$69.1". The minimum function here would transform it by string, which is incorrect.

Is there a good way of using a custom transform function only dealing with the first part of it?

An example input is:

df = pd.DataFrame({'0.001': {('Periodic', 'Klinger'): '0.3$\\pm$0.05',
  ('Periodic', 'Malte'): '0.26$\\pm$0.06',
  ('Periodic', 'Merkelig'): '0.22$\\pm$0.12',
  ('Periodic', 'Dings'): '0.18$\\pm$0.06',
  ('Periodic', 'Elf'): '0.28$\\pm$0.11',
  ('Periodic', 'Rar'): '0.2$\\pm$0.1',
  ('Periodic', 'Merd'): '0.12$\\pm$0.14',
  ('Sequential', 'Klinger'): '0.15$\\pm$0.14',
  ('Sequential', 'Malte'): '0.1$\\pm$0.1',
  ('Sequential', 'Merkelig'): '0.26$\\pm$0.09',
  ('Sequential', 'Dings'): '0.17$\\pm$0.16',
  ('Sequential', 'Elf'): '0.15$\\pm$0.12',
  ('Sequential', 'Rar'): '0.12$\\pm$0.1',
  ('Sequential', 'Merd'): '0.21$\\pm$0.13'},
 '0.01': {('Periodic', 'Klinger'): '1.75$\\pm$1.27',
  ('Periodic', 'Malte'): '1.19$\\pm$1.51',
  ('Periodic', 'Merkelig'): '2.31$\\pm$0.54',
  ('Periodic', 'Dings'): '2.47$\\pm$0.37',
  ('Periodic', 'Elf'): '2.3$\\pm$1.3',
  ('Periodic', 'Rar'): '1.65$\\pm$0.59',
  ('Periodic', 'Merd'): '1.07$\\pm$1.68',
  ('Sequential', 'Klinger'): '1.14$\\pm$0.25',
  ('Sequential', 'Malte'): '2.99$\\pm$1.36',
  ('Sequential', 'Merkelig'): '2.85$\\pm$1.06',
  ('Sequential', 'Dings'): '2.61$\\pm$0.79',
  ('Sequential', 'Elf'): '1.62$\\pm$1.47',
  ('Sequential', 'Rar'): '1.29$\\pm$0.74',
  ('Sequential', 'Merd'): '2.88$\\pm$0.89'},
 '0.1': {('Periodic', 'Klinger'): '18.75$\\pm$12.96',
  ('Periodic', 'Malte'): '15.9$\\pm$9.8',
  ('Periodic', 'Merkelig'): '36.47$\\pm$1.42',
  ('Periodic', 'Dings'): '16.13$\\pm$13.24',
  ('Periodic', 'Elf'): '26.36$\\pm$11.08',
  ('Periodic', 'Rar'): '11.26$\\pm$12.32',
  ('Periodic', 'Merd'): '17.55$\\pm$10.78',
  ('Sequential', 'Klinger'): '36.26$\\pm$3.19',
  ('Sequential', 'Malte'): '20.2$\\pm$14.42',
  ('Sequential', 'Merkelig'): '18.62$\\pm$15.79',
  ('Sequential', 'Dings'): '5.64$\\pm$7.28',
  ('Sequential', 'Elf'): '25.55$\\pm$12.74',
  ('Sequential', 'Rar'): '19.65$\\pm$16.98',
  ('Sequential', 'Merd'): '14.53$\\pm$2.54'}})

There are three columns, 0.1, 0.01 and 0.001. There is a multiindex consisting of two values and I want the minimum values within every column for each multiindex-first-value.

Everything is done by

df.groupby(level=0).transform('min')

but the minimum function is wrong due to the format of the values

CodePudding user response:

Is this any helpful for your use case ?

import re

def min_float(x):
    floats = [float(re.search(r'([\d\.] )', val).group(1)) for val in x]
    return min(floats)

df.groupby(level=0).transform(min_float)

CodePudding user response:

Use Series.str.extract for values before $, convert to floats, so possible use min:

f = lambda x: x.str.extract('([\d\.] )\$', expand=False).astype(float).min()
df = df.groupby(level=0).transform(f)
print (df)
                     0.001  0.01    0.1
Periodic   Klinger    0.12  1.07  11.26
           Malte      0.12  1.07  11.26
           Merkelig   0.12  1.07  11.26
           Dings      0.12  1.07  11.26
           Elf        0.12  1.07  11.26
           Rar        0.12  1.07  11.26
           Merd       0.12  1.07  11.26
Sequential Klinger    0.10  1.14   5.64
           Malte      0.10  1.14   5.64
           Merkelig   0.10  1.14   5.64
           Dings      0.10  1.14   5.64
           Elf        0.10  1.14   5.64
           Rar        0.10  1.14   5.64
           Merd       0.10  1.14   5.64

If need also original values is possible use Series.sort_values with key parameter for sortying by floats and then select first value by position:

f = lambda x: x.sort_values(key = lambda y: y.str.extract('([\d\.] )\$', expand=False).astype(float)).iat[0]
df = df.groupby(level=0).transform(f)
print (df)
                             0.001           0.01              0.1
Periodic   Klinger   0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
           Malte     0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
           Merkelig  0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
           Dings     0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
           Elf       0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
           Rar       0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
           Merd      0.12$\pm$0.14  1.07$\pm$1.68  11.26$\pm$12.32
Sequential Klinger     0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
           Malte       0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
           Merkelig    0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
           Dings       0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
           Elf         0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
           Rar         0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
           Merd        0.1$\pm$0.1  1.14$\pm$0.25    5.64$\pm$7.28
  • Related