This is a follow up on this question: How to create new column based on substrings in other column in a pandas dataframe?
The dataframe has the following structure
df = pd.DataFrame({
'Other input': ['Text A', 'Text B', 'Text C', 'Text D', 'Text E'],
'Substance': ['(NPK) 20/10/6', NaN, '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
'value': [0.2, NaN, 0.6, 0.8, .9]
})
Other Input substance value
0 Text A (NPK) 20/10/6 0.2
1 Text B NaN NaN
2 Text C 46%N / O%P2O5 (Urea) 0.6
3 Text D 46%N / O%P2O5 (Urea) 0.8
4 Text E (NPK) DAP Diammonphosphat; 18/46/0 0.9
It was created by merging two df's with a left join and it turns out that I have rows without substance and value. I need to replace the substance with a Short Name and before having missing values in the dataset, the following code worked:
test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')
How can I make this work with NaN (or 0 if that is easier)? Is there something equivalent to na_action=None
that apparently works with applymap?
CodePudding user response:
If you want to skip rows containing NaN, just add a call to dropna()
before you apply()
. That will create a new temporary copy of the dataframe with all rows containing NaN in any columns removed.
test['Short Name'] = test.dropna()['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')
Output:
>>> test
Other input Substance value Te
0 Text A (NPK) 20/10/6 0.2 (NPK)
1 Text B NaN NaN NaN
2 Text C 46%N / O%P2O5 (Urea) 0.6 Urea
3 Text D 46%N / O%P2O5 (Urea) 0.8 Urea
4 Text E (NPK) DAP Diammonphosphat; 18/46/0 0.9 DAP
This will work, because assigning Series
objects to DataFrame
objects uses their indexes, and if you inspect the return value of the apply()
call after adding dropna()
:
>>> test.dropna()['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')
0 (NPK)
2 Urea
3 Urea
4 DAP
Name: Substance, dtype: object
Notice how it skips from 0 to 2. That's because the row at index 1 was removed, but the indexes weren't updated (which we want in this case).
CodePudding user response:
You can do:
df = df.assign(
short_name = df.Substance.apply(
lambda x:
do_this_if_x_is_not_NaN(x) if x is not np.nan
else do_this_if_x_is_NaN(x)))
with functions:
def do_this_if_x_is_not_NaN(x):
return 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)'
def do_this_if_x_is_NaN(x):
return np.nan # keeping the NaN, or whatever you want to return if x is NaN
df = df.assign(col_name = ...)
is just an other way of expressing df['col_name'] = ...
.
Your df
will become:
Other input Substance value short_name
0 Text A (NPK) 20/10/6 0.2 (NPK)
1 Text B NaN NaN NaN
2 Text C 46%N / O%P2O5 (Urea) 0.6 Urea
3 Text D 46%N / O%P2O5 (Urea) 0.8 Urea
4 Text E (NPK) DAP Diammonphosphat; 18/46/0 0.9 DAP