I have a large dataframe that has temperature measurements where some of the values are missing. The values are in two separate columns, where one has the actual measurements (TEMP), while the other column has only estimated temperatures (TEMP_ESTIMATED).
I'm trying to create a new column where these 2 values are combined in a way that the new column would have the actual measurement values if the value exists (is not NaN), and otherwise the new column would have the estimated values. Example of dataframe and how I would want it to look after the for-loop.
I have tried many different ways to do this but none of them have worked so far. I'm still new to programming so I apologize if there are some obvious mistakes, just trying to learn more!
What I tried the last time but the values were not added to the new column (I have imported pandas already and all the temperature data is saved to the data.DataFrame):
for i in range(len(data)):
if data.at[i, 'TEMP'] == 'NaN':
data.at[i, 'TEMP_ALL'] = data.at[i, 'TEMP_ESTIMATED']
else:
data.at[i, 'TEMP_ALL'] = data.at[i, 'TEMP']
I would greatly appreciate any feedback on this or any alternate ways how to achieve the desired result, thank you!
CodePudding user response:
You can try using np.where
:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'DATE': ['20100101', '20100102', '20100103', '20100104', '20100105'],
'TEMP': [np.nan, np.nan, np.nan, 15, 20],
'TEMP_ESTIMATED': [10, 15, 16, 17, 22]})
df = df.rename_axis('index')
df['TEMP_ALL'] = np.where(np.isnan(df.TEMP), df.TEMP_ESTIMATED, df.TEMP)
index | DATE | TEMP | TEMP_ESTIMATED | TEMP_ALL |
---|---|---|---|---|
0 | 20100101 | nan | 10 | 10 |
1 | 20100102 | nan | 15 | 15 |
2 | 20100103 | nan | 16 | 16 |
3 | 20100104 | 15 | 17 | 15 |
4 | 20100105 | 20 | 22 | 20 |
If your NaN values are strings, try:
df['TEMP_ALL'] = np.where(df.TEMP == 'NaN', df.TEMP_ESTIMATED, df.TEMP)