How to handle empty string type data in pandas python-CodePudding

This is a very naive question but after referring to multiple articles, I am raising this concern. I have a column in the dataset where the column has numeric/blank/null values. I have data like below:

fund_value

Null
123
-10

I wrote a method to handle it but it doesn't work and keeps on giving me the error:

def values(x):
    if x:
        if int(x) > 0:
            return 'Positive'
        elif int(x) < 0:
            return 'Negative'
    else:
        return 'Zero'

    df2 = pd.read_csv('/home/siddhesh/Downloads/s2s_results.csv')  # Assuming it as query results
    df2 = df2.astype(str)
    df2['fund_value'] = df2.fund_value.apply(values)

Error:

Traceback (most recent call last):
  File "/home/../Downloads/pyspark/src/sample/actual_dataset_testin.py", line 31, in <module>
    df2['fund_value'] = df2.fund_value.apply(values)
  File "/home/../.local/lib/python3.8/site-packages/pandas/core/series.py", line 4357, in apply
    return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
  File "/home/../.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1043, in apply
    return self.apply_standard()
  File "/home/../.local/lib/python3.8/site-packages/pandas/core/apply.py", line 1099, in apply_standard
    mapped = lib.map_infer(
  File "pandas/_libs/lib.pyx", line 2859, in pandas._libs.lib.map_infer
  File "/home/../Downloads/pyspark/src/sample/actual_dataset_testin.py", line 16, in values
    if int(x) > 0:
ValueError: invalid literal for int() with base 10: 'nan'

I even tried if x=="" or if not x: but nothing worked.

Expected Output:

fund_value  
Zero
Positive
Negative

CodePudding user response：

Considering df to be:

In [1278]: df = pd.DataFrame({'fund_value': [np.nan, 123, '', 10]})

In [1279]: df
Out[1279]: 
  fund_value
0        NaN
1        123
2           
3         10

Use numpy.select with pd.to_numeric:

In [1246]: import numpy as np

In [1283]: df['fund_value'] = pd.to_numeric(df.fund_value, errors='coerce')

In [1247]: conds = [df.fund_value.gt(0), df.fund_value.lt(0)]
In [1250]: choices = ['Positive', 'Negative']

In [1261]: df['fund_value'] = np.select(conds, choices, default='Zero')

In [1288]: df
Out[1288]: 
  fund_value
0       Zero
1   Positive
2       Zero
3   Positive

CodePudding user response：

You are facing a problem of NaN support with int. That is something that does not work...

Your solution: Fill your "missing" values using pd.fillna(). Fill those values with something (e.g., with 0), or remove them. Just read the values as float, which has native NaN support, then fill or remove those NaN

Background: The fact that you first cast the column to str, but then in your check function convert it back to int, which gives you a NaN error, looks like a workaround... Here is what causes the problem: Reading directly as int won't work, as int does not understand NaN --> see Int with capital I

Exampe: Assume you have a 'dirty int' input, that includes NaN, like this

df = pd.DataFrame({'fund_value': [None, 123, 10]})
   fund_value
0         NaN
1       123.0
2        10.0

Pandas will do you the courtesy of converting this to float, given all values are numeric, and fills the "gaps" (None or np.nan) with NaN. You get something to screen, but in fact it is a column of float, not int.

Option 1: How to "convert" NaN values to '0' integer (for your case distinction between 'positive' or 'negative')

df.fillna(0).astype('int')

Option 2: Directly cast a column with NaN values to Int:

df.astype('Int32')

You can then work with either one of the datasets, which truly contain integers (option 1 with assuming all NaN==0, second one with true <NA>, not floats)

CodePudding user response：

Your df2.astype(str) turns everything into string, and when you apply values(...) to the contents of the column which are all string, the first if-check will only return you a False if it's an empty string which is not the case for str(np.nan). Converting np.nan into string gives you a non-empty string 'nan'.

'nan' will pass your first if-check, and then in the second if-check, it finds itself not convertible into an int and python returns you an error.

To take care of that,

x = df['fund_value'].replace('', np.nan).astype(float)

(x > 0).map({True: 'Positive'}).fillna( (x < 0).map({True: 'Negative'}) ).fillna('Zero')