So I was having trouble with a NaN error where asking for df['column'] was only showing NaN and I've narrowed it down to this specific part of the code and i think it has something to do with the way I have mapped the data. Does anyone have any idea?
My code is below:
df['country_code'] = df['country_code'].replace(['?'], ) - *there were some '?' values so I wanted to make this empty so that i could later replace with the mean once I'd converted everything to integer*
country_code_map = {'AUS': 1, 'USA': 2, 'CAN': 3, 'BGD': 4, 'BRZ': 5, 'JP': 6, 'ID': 7, 'HR': 8, 'CH': 9, 'FRA': 10, 'FIN': 11}
df['country_code'] = df['country_code'].map(country_code_map)
df['country_code'] = pd.to_numeric(df['country_code'])
df['country_code'] = df['country_code'].replace([''], df['country_code'].mean)
Let me know if any extra info req'd.
CodePudding user response:
I've created df['country_code']
in the following way, you should have something similar:
import pandas as pd
d = {'country_code': ["?", "BRZ", "USA"]}
df = pd.DataFrame(data=d)
print(df)
Output:
country_code
0 ?
1 BRZ
2 USA
Now if I execute your code, this is what I get:
country_code
0 NaN
1 5.0
2 2.0
You're getting a NaN value in the output instead of a mean over the column for the following reason.
Let's take a look at this line:
df['country_code'] = df['country_code'].replace(['?'], )
print(df)
Output:
country_code
0 NaN
1 5.0
2 2.0
Here you're not erasing the ?
s leaving the place empty, but you're filling it with NaN values.
So when you get to the last line, what you're trying to do is to replace empty strings ''
, but you have NaNs. What you should use instead is DataFrame.fillna
, to fill the NaNs, like this:
df['country_code'] = df['country_code'].replace(['?'], )
country_code_map = {'AUS': 1, 'USA': 2, 'CAN': 3, 'BGD': 4, 'BRZ': 5, 'JP': 6, 'ID': 7, 'HR': 8, 'CH': 9, 'FRA': 10, 'FIN': 11}
df['country_code'] = df['country_code'].map(country_code_map)
df['country_code'] = pd.to_numeric(df['country_code'])
df['country_code'] = df['country_code'].fillna(df['country_code'].mean())
Output:
country_code
0 3.5
1 5.0
2 2.0
CodePudding user response:
So I realised the issue was in my mapping and converting to an integer. It will automatically do this once I have mapped the data.
Therefore the code should look like this:
country_code_map = {'AUS': 1, 'USA': 2, 'CAN': 3, 'BGD': 4, 'BRZ': 5, 'JP': 6, 'ID': 7, 'HR': 8, 'CH': 9, 'FRA': 10, 'FIN': 11}
df['country_code'] = df['country_code'].map(country_code_map)
Then I can check the mean without getting the NaN values as I was before:
df['country_code'].mean)