I have a dataframe of the following structure:
df = pd.DataFrame({
'Substance': ['(NPK) 20/10/6', '(NPK) Guayacan 10/20/30', '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
'value': [0.2, 0.4, 0.6, 0.8, .9]
})
substance value
0 (NPK) 20/10/6 0.2
1 (NPK) Guayacan 10/20/30 0.4
2 46%N / O%P2O5 (Urea) 0.6
3 46%N / O%P2O5 (Urea) 0.8
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9
Now I want to create a new column with the short names of substance:
test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if
any(i in x for i in 'Urea') else '(NPK)')
There are two issues with the last line of code. First of all, the output looks like this:
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 Urea
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9 (NPK)
So the second entry was also labeled with Urea although it should be NPK.
Furthermore, my actual data also produces the following error, which I interestingly / annoyingly can't reproduce with the dummy data despite using the original substance names.
/var/folders/tf/hzv31v4x42q4_mnw4n8ldhsm0000gn/T/ipykernel_10743/136042259.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Note: Since I have further substances, I will have to add more statements to the if/else loop.
Edit: The substance names need to be mapped to the following list of short names:
- Urea if Substance includes Urea
- Calcium ammonium nitrate (CAN) if Substance includes CAN
- Di-ammonium phosphate (DAP) if Substance includes DAP
- Other complex NK, NPK fertilizer for all other cases
Expected output for the sample data would be
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 (NPK)
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9 (NPK)
Edit2: I would then like to add a statement such that I receive the following output:
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 (NPK)
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9 DAP
CodePudding user response:
Try this:
df['Short Name'] = df['Substance'].str.extract(r'\((. ?)\)')
Output:
>>> df
Substance value Short Name
0 (NPK) 20/10/6 0.2 NPK
1 (NPK) Guayacan 10/20/30 0.4 NPK
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) 20/10/6 0.9 NPK
CodePudding user response:
Works for me:
df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else '(NPK)')
>>> df
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 (NPK)
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) 20/10/6 0.9 (NPK)
regex:
import re
short = re.compile(r"\W*(urea)\W*", re.I)
df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if len(short.findall(x.lower())) else '(NPK)')
CodePudding user response:
Not the neatest solution but at least a solution:
test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')