i have data where there are missing values in salary1 column and i am trying to replace it with a range between the minimum value and maximum value. i tried with below code but got error. could someone please help me? thank in advance.
code
a = df['salary1'].max()
b = df['salary1'].min()
df['salary1'] = df['salary1'].replace(np.nan, range(b,a))
error
TypeError: 'float' object cannot be interpreted as an integer
data
salary1 Experience
0 NaN 2
1 100000.0 3
2 NaN 4
3 NaN 4
4 NaN 1
... ... ...
12884 NaN 1
12885 NaN 3
12886 150000.0 2
12887 NaN 2
12888 NaN 4
CodePudding user response:
range
returns a sequence/iterator, use np.random.rand
CodePudding user response:
range
is the wrong function to use. list(range(a,b))
returns a sequence of numbers from a
to (but not including) b
. For example, list(range(5,10))
returns [5,6,7,8,9]
Try with numpy.random.randint
:
import numpy as np
df['salary1'] = df['salary1'].apply(lambda x: x if pd.notnull(x) else np.random.randint(df['salary1'].min(), df['salary1'].max()))
Example:
>>> df
salary1
0 100000.0
1 NaN
2 150000.0
3 NaN
4 175000.0
5 NaN
>>> df['salary1'].apply(lambda x: x if pd.notnull(x) else np.random.randint(df['salary1'].min(), df['salary1'].max()))
0 100000.0
1 119091.0
2 150000.0
3 171438.0
4 175000.0
5 114396.0
Name: salary1, dtype: float64
CodePudding user response:
In the below code you are generating random int between range [a,b]-
df['salary1'] = df['salary1'].replace(np.nan, random.randint(a, b))