Home > Blockchain >  How to deal with string integers in a dataframe when using the pandas query method?
How to deal with string integers in a dataframe when using the pandas query method?

Time:08-13

import numpy as np
import pandas as pd
numbers = ["1", "2", "3", "4", "5", "6", "Missing Value", "7"]
df = pd.DataFrame(numbers, columns=["Numbers"])
new_df = df.query("Numbers > 4")
print(new_df)

When you run the above code, you get the following error message: TypeError: '>' not supported between instances of 'str' and 'int'. This is because the numbers in the dataframe are all strings, but in the query, we are checking if the number string in the column is greater than the integer 4.

How would you deal with a situation like this given that you are required to use the query method and don't want to change all the numbers to integers. I.e. something like this:

new_df = df.query("int(Numbers) > 4")

which doesn't work by the way. Also, it would need to handle occassional strings that can't be converted to an integer like "Missing Value".

CodePudding user response:

We have to_numeric

df['Numbers'] = pd.to_numeric(df['Numbers'], errors = 'coerce')
new_df = df.query("Numbers > 4")
new_df
Out[10]: 
   Numbers
4      5.0
5      6.0
7      7.0

CodePudding user response:

import numpy as np
import pandas as pd

numbers = ["1", "2", "3", "4", "5", "6", "Missing Value", "7"]
df = pd.DataFrame(numbers, columns=["Numbers"])
df.Numbers = df.Numbers.apply(
    lambda n: np.NaN if n == 'Missing Value' else float(n)
)
new_df = df.query("Numbers > 4")
print(new_df)

prints

index Numbers
4 5.0
5 6.0
7 7.0

CodePudding user response:

You're going to have issues if you keep the column as str. For example:

>>> '14' > '4'
False

If you really want to, you can specify a string '4' in your query:

>>> df.query("Numbers > '4'")
         Numbers
4              5
5              6
6  Missing Value
7              7

However, I think you should just accept that Missing Value should be NaN and follow BENY's answer.


If you're not forced to use query due to some school requirement or something... there are other ways as well, where you can do the casting without actually applying it to the column:

>>> out = df[pd.to_numeric(df.Numbers, errors='coerce').gt(4)]
>>> out
  Numbers
4       5
5       6
7       7

# Your "Numbers" are still strings:
>> out.dtypes
Numbers    object
dtype: object
  • Related