pandas return dataframe rows that are not numbers between 0 and 1-CodePudding

I have a dataframe:

  Region     Value       ID
    Asia       N/A    XYAZZ
    Asia       '0'    ZSJDS
    EMEA       '2'    KKKKK
    USME    '0.50'    KLJSA
    APAC    NODATA    JKUUS
    APAC    '1.00'    JKUUC

In 'Value' column I have strings that may contain a number as string, N/A or any other string. I need a code that will return me any values that are NOT in this range: 0 < x <= 1, like this:

  Region     Value       ID
    Asia       N/A    XYAZZ
    Asia       '0'    ZSJDS
    EMEA       '2'    KKKKK
    APAC    NODATA    JKUUS

I tried something like this

  data_types_dict = {'Value': float}
  df = df.astype(data_types_dict, errors='ignore')
  df = df[(df['Value'] <= 0) | (df['Value'] > 1)]

but it works only when there are only numbers and string-numbers, it fails when converting NODATA to float. I thought that I could first filter out all rows with strings that are non convertible to float, then perform the above operation, then concat it, like this:

df1 = df1['Value'].StringsNonFloat
df2 = df[(df['Value'] <= 0) | (df['Value'] > 1)]
final_df = concat([df1, df2])

but is this pythonic? is there such a function?

CodePudding user response：

df = df[~(df['Value'].apply(pd.to_numeric, errors='coerce').notnull() & df['Value'].apply(pd.to_numeric, errors='coerce').
    between(0, 1, inclusive='right'))]

print(df)

output:

  Region   Value     ID
0   Asia     NaN  XYAZZ
1   Asia       0  ZSJDS
2   EMEA       2  KKKKK
4   APAC  NODATA  JKUUS

CodePudding user response：

I believe you may be looking for something like this:

df = df[(df["Value"].astype(float) > 1) | (df["Value"].astype(float) <= 0)]

So taking the data frame where the value as type float is greater than one or less than/equal to 0.

This is very similar to your code but using .astype(float) instead.

Edit (following comments)

I would do the following:

Start by converting value to float if possible

df["Value"] = [float(value) if value.isdigit() else value for value in df["Value"]]

Then subset by only those that are float:

df = df[df["Value"].apply(lambda x: isinstance(x,float))]

Then subset again by those that are in the desired range:

df = df[(df["Value"] > 1) | (df["Value"] <= 0)]