Home > front end >  Python Pandas apply a function on non empty cells
Python Pandas apply a function on non empty cells

Time:11-07

I want to apply a function on non empty cells of a column that contains either integer numbers or empty cells. I checked the data type of that column, it is object.

This is part of the DataFrame:

import pandas as pd
from numpy import nan

df = pd.DataFrame(
    {'Seam_Height': {0: 72.0, 1: 108.0, 2: nan, 3: nan, 4: 84.0, 5: 96.0,
                     6: nan, 7: 108.0, 8: 120.0, 9: nan, 10: 120.0, 11: nan,
                     12: 120.0, 13: 107.0},
     'mining_method': {0: 'Longwall', 1: 'Longwall', 2: 'Longwall',
                       3: 'Longwall', 4: 'Longwall', 5: 'Longwall',
                       6: 'Longwall', 7: 'Longwall', 8: 'Longwall',
                       9: 'Longwall', 10: 'Longwall', 11: 'Longwall',
                       12: 'Longwall', 13: 'Longwall'},
     'employee_num ': {0: 508.0, 1: 161.0, 2: nan, 3: nan, 4: 547.0, 5: 354.0,
                       6: 456.0, 7: nan, 8: 515.0, 9: 515.0, 10: nan, 11: 515.0,
                       12: 515.0, 13: 235.0}}
)

    Seam_Height mining_method  employee_num 
0          72.0      Longwall          508.0
1         108.0      Longwall          161.0
2           NaN      Longwall            NaN
3           NaN      Longwall            NaN
4          84.0      Longwall          547.0
5          96.0      Longwall          354.0
6           NaN      Longwall          456.0
7         108.0      Longwall            NaN
8         120.0      Longwall          515.0
9           NaN      Longwall          515.0
10        120.0      Longwall            NaN
11          NaN      Longwall          515.0
12        120.0      Longwall          515.0
13        107.0      Longwall          235.0

This is the function that I used to classify the thickness of seam height, it is a very simple function:

def seam_thickness_class_func(var):
     if var < 43:
         return "V_low"
     if var < 60:
        return "Low"
    if var < 72:
         return "Medium"
    else:
          return "High"



df['Seam_class'] = df.apply(lambda x: seam_thickness_class_func(x["Seam_Height"]) if(pd.notnull(x[0])) else " ", axis = 1) 

The function will be applied if the cell contains a number, while if it is empty, it retruns " ".

I get this error message when I apply the function:

TypeError: '<' not supported between instances of 'str' and 'int'

CodePudding user response:

Let's convert to_numeric then pd.cut instead. pd.cut is specifically designed to:

Bin values into discrete intervals.

We can bin the values:

  1. (-∞, 43) with label V_low
  2. [43, 60) with Low
  3. [60, 72) with Medium
  4. [72, ∞) with High

Note right=False means upper bound non-inclusive. Which is analogous of strictly less than in the shown function.

# import numpy as np
df['Seam_class'] = pd.cut(
    pd.to_numeric(df['Seam_Height'], errors='coerce'),
    bins=[np.NINF, 43, 60, 72, np.inf],
    labels=['V_low', 'Low', 'Medium', 'High'],
    right=False
)

df:

   Seam_Height mining_method  employee_num  Seam_class
0         72.0      Longwall          508.0       High
1        108.0      Longwall          161.0       High
2                   Longwall            NaN        NaN
3          NaN      Longwall            NaN        NaN
4         84.0      Longwall          547.0       High
5         96.0      Longwall          354.0       High
6          NaN      Longwall          456.0        NaN
7        108.0      Longwall            NaN       High
8        120.0      Longwall          515.0       High
9          NaN      Longwall          515.0        NaN
10       120.0      Longwall            NaN       High
11         NaN      Longwall          515.0        NaN
12       120.0      Longwall          515.0       High
13       107.0      Longwall          235.0       High

We can further add_categories and fillna for missing values to be replaced with ' ':

# import numpy as np
df['Seam_class'] = pd.cut(
    pd.to_numeric(df['Seam_Height'], errors='coerce'),
    bins=[np.NINF, 43, 60, 72, np.inf],
    labels=['V_low', 'Low', 'Medium', 'High'],
    right=False
).cat.add_categories(' ').fillna(' ')

df:

    Seam_Height mining_method  employee_num  Seam_class
0          72.0      Longwall          508.0       High
1         108.0      Longwall          161.0       High
2           NaN      Longwall            NaN           
3           NaN      Longwall            NaN           
4          84.0      Longwall          547.0       High
5          96.0      Longwall          354.0       High
6           NaN      Longwall          456.0           
7         108.0      Longwall            NaN       High
8         120.0      Longwall          515.0       High
9           NaN      Longwall          515.0           
10        120.0      Longwall            NaN       High
11          NaN      Longwall          515.0           
12        120.0      Longwall          515.0       High
13        107.0      Longwall          235.0       High

If we need to fix the apply version, we should use Series.apply instead after converting the column to_numeric to ensure we're only dealing with numeric values, and address the null checking in the function itself:

def seam_thickness_class_func(var):
    # Test isnull here
    if pd.isnull(var):
        return ' '
    if var < 43:
        return "V_low"
    if var < 60:
        return "Low"
    if var < 72:
        return "Medium"
    return "High"


df['Seam_class'] = pd.to_numeric(
    df['Seam_Height'], errors='coerce'
).apply(seam_thickness_class_func)

df:

    Seam_Height mining_method  employee_num  Seam_class
0          72.0      Longwall          508.0       High
1         108.0      Longwall          161.0       High
2           NaN      Longwall            NaN           
3           NaN      Longwall            NaN           
4          84.0      Longwall          547.0       High
5          96.0      Longwall          354.0       High
6           NaN      Longwall          456.0           
7         108.0      Longwall            NaN       High
8         120.0      Longwall          515.0       High
9           NaN      Longwall          515.0           
10        120.0      Longwall            NaN       High
11          NaN      Longwall          515.0           
12        120.0      Longwall          515.0       High
13        107.0      Longwall          235.0       High
  • Related