Home > OS >  Using apply and lambda function on Pandas dataframe column that contains missing values
Using apply and lambda function on Pandas dataframe column that contains missing values

Time:12-12

This is a follow up on this question: How to create new column based on substrings in other column in a pandas dataframe?

The dataframe has the following structure

df = pd.DataFrame({
    'Other input': ['Text A', 'Text B', 'Text C', 'Text D', 'Text E'],
    'Substance': ['(NPK) 20/10/6', NaN, '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
    'value': [0.2, NaN, 0.6, 0.8, .9]
})

    Other Input  substance               value
0   Text A       (NPK) 20/10/6           0.2
1   Text B       NaN                     NaN
2   Text C       46%N / O%P2O5 (Urea)    0.6
3   Text D       46%N / O%P2O5 (Urea)    0.8
4   Text E       (NPK) DAP Diammonphosphat; 18/46/0          0.9

It was created by merging two df's with a left join and it turns out that I have rows without substance and value. I need to replace the substance with a Short Name and before having missing values in the dataset, the following code worked:

test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')

How can I make this work with NaN (or 0 if that is easier)? Is there something equivalent to na_action=None that apparently works with applymap?

CodePudding user response:

If you want to skip rows containing NaN, just add a call to dropna() before you apply(). That will create a new temporary copy of the dataframe with all rows containing NaN in any columns removed.

test['Short Name'] = test.dropna()['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')

Output:

>>> test
  Other input                           Substance  value     Te
0      Text A                       (NPK) 20/10/6    0.2  (NPK)
1      Text B                                 NaN    NaN    NaN
2      Text C                46%N / O%P2O5 (Urea)    0.6   Urea
3      Text D                46%N / O%P2O5 (Urea)    0.8   Urea
4      Text E  (NPK) DAP Diammonphosphat; 18/46/0    0.9    DAP

This will work, because assigning Series objects to DataFrame objects uses their indexes, and if you inspect the return value of the apply() call after adding dropna():

>>> test.dropna()['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')
0    (NPK)
2     Urea
3     Urea
4      DAP
Name: Substance, dtype: object

Notice how it skips from 0 to 2. That's because the row at index 1 was removed, but the indexes weren't updated (which we want in this case).

CodePudding user response:

You can do:

df = df.assign(
    short_name = df.Substance.apply(
        lambda x:
            do_this_if_x_is_not_NaN(x) if x is not np.nan
            else do_this_if_x_is_NaN(x)))

with functions:

def do_this_if_x_is_not_NaN(x):
    return 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)'

def do_this_if_x_is_NaN(x):
    return np.nan # keeping the NaN, or whatever you want to return if x is NaN

df = df.assign(col_name = ...) is just an other way of expressing df['col_name'] = ....

Your df will become:

  Other input                           Substance  value short_name
0      Text A                       (NPK) 20/10/6    0.2      (NPK)
1      Text B                                 NaN    NaN        NaN
2      Text C                46%N / O%P2O5 (Urea)    0.6       Urea
3      Text D                46%N / O%P2O5 (Urea)    0.8       Urea
4      Text E  (NPK) DAP Diammonphosphat; 18/46/0    0.9        DAP
  • Related