This is a very naive question but after referring to multiple articles, I am raising this concern. I have a column in the dataset where the column has numeric/blank/null
values.
I have data like below:
fund_value
Null
123
-10
I wrote a method to handle it but it doesn't work and keeps on giving me the error:
def values(x):
if x:
if int(x) > 0:
return 'Positive'
elif int(x) < 0:
return 'Negative'
else:
return 'Zero'
df2 = pd.read_csv('/home/siddhesh/Downloads/s2s_results.csv') # Assuming it as query results
df2 = df2.astype(str)
df2['fund_value'] = df2.fund_value.apply(values)
Error
:
Traceback (most recent call last):
File "/home/../Downloads/pyspark/src/sample/actual_dataset_testin.py", line 31, in <module>
df2['fund_value'] = df2.fund_value.apply(values)
File "/home/../.local/lib/python3.8/site-packages/pandas/core/series.py", line 4357, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "/home/../.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1043, in apply
return self.apply_standard()
File "/home/../.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1099, in apply_standard
mapped = lib.map_infer(
File "pandas/_libs/lib.pyx", line 2859, in pandas._libs.lib.map_infer
File "/home/../Downloads/pyspark/src/sample/actual_dataset_testin.py", line 16, in values
if int(x) > 0:
ValueError: invalid literal for int() with base 10: 'nan'
I even tried if x==""
or if not x:
but nothing worked.
Expected Output
:
fund_value
Zero
Positive
Negative
CodePudding user response:
Considering df to be:
In [1278]: df = pd.DataFrame({'fund_value': [np.nan, 123, '', 10]})
In [1279]: df
Out[1279]:
fund_value
0 NaN
1 123
2
3 10
Use numpy.select
with pd.to_numeric
:
In [1246]: import numpy as np
In [1283]: df['fund_value'] = pd.to_numeric(df.fund_value, errors='coerce')
In [1247]: conds = [df.fund_value.gt(0), df.fund_value.lt(0)]
In [1250]: choices = ['Positive', 'Negative']
In [1261]: df['fund_value'] = np.select(conds, choices, default='Zero')
In [1288]: df
Out[1288]:
fund_value
0 Zero
1 Positive
2 Zero
3 Positive
CodePudding user response:
You are facing a problem of NaN
support with int
. That is something that does not work...
Your solution: Fill your "missing" values using pd.fillna()
. Fill those values with something (e.g., with 0
), or remove them. Just read the values as float
, which has native NaN
support, then fill or remove those NaN
Background: The fact that you first cast the column to str
, but then in your check function convert it back to int
, which gives you a NaN
error, looks like a workaround... Here is what causes the problem: Reading directly as int
won't work, as int
does not understand NaN
--> see Int
with capital I
Exampe: Assume you have a 'dirty int' input, that includes NaN
, like this
df = pd.DataFrame({'fund_value': [None, 123, 10]})
fund_value
0 NaN
1 123.0
2 10.0
Pandas will do you the courtesy of converting this to float
, given all values are numeric, and fills the "gaps" (None
or np.nan
) with NaN
. You get something to screen, but in fact it is a column of float
, not int
.
Option 1: How to "convert" NaN
values to '0' integer (for your case distinction between 'positive' or 'negative')
df.fillna(0).astype('int')
Option 2: Directly cast a column with NaN
values to Int
:
df.astype('Int32')
You can then work with either one of the datasets, which truly contain integers (option 1 with assuming all NaN
==0, second one with true <NA>
, not floats
)
CodePudding user response:
Your df2.astype(str)
turns everything into string, and when you apply values(...)
to the contents of the column which are all string, the first if
-check will only return you a False
if it's an empty string which is not the case for str(np.nan)
. Converting np.nan
into string gives you a non-empty string 'nan'
.
'nan'
will pass your first if
-check, and then in the second if
-check, it finds itself not convertible into an int
and python returns you an error.
To take care of that,
x = df['fund_value'].replace('', np.nan).astype(float)
(x > 0).map({True: 'Positive'}).fillna( (x < 0).map({True: 'Negative'}) ).fillna('Zero')