Dealing with filtering a column with both numbers and strings using the panda query method-CodePudding

def filter_dataframe(dataframe, column, numbers, strings=None):
    number_query = f"({column} >= {numbers[0]} and {column} <= {numbers[1]})"
    if strings is not None:
        single_string_query_list = []
        for string in strings:
            single_string_query = f"({column} == '{string}')"
            single_string_query_list.append(single_string_query)
        if len(strings) == 1:
            string_query = f"{' or '.join(single_string_query_list)}"
        else:
            string_query = f"({' or '.join(single_string_query_list)})"
        
        full_query = number_query   " or "   string_query
    else:
         full_query = number_query
    dataframe = dataframe.query(full_query)
    return dataframe

I am writing a function filter_dataframe that allows you to filter a specific dataframe by the values in a specific column. In particular, a column that contains both strings a numbers. The numbers parameter should be a list in the form [min value, max value] and the strings parameter should be a list of the string values you want. For example, let's say that I want to filter the below dataframe based on the values inside the "Numbers" column

             Numbers
0            1
1            2
2            3
3            4
4            5
5            6
6            "example 1"
7            7
8            "example 2"
9            "example 3"
10           8

which is created by the below code

numbers = [1, 2, 3, 4, 5, 6, "example 1", 7, "example 2", "example 3", 8]
df = pd.DataFrame(numbers, columns=["Numbers"])

and I want only the rows that have a number between 1 and 6 inclusive or contain the string "example 1" or "example 2". Assuming that filter_dataframe works, then we would execute

filter_dataframe(dataframe = df, column = "Numbers", numbers = [1,6], strings = ["example 1", "example 2"])

Essentially what happens inside filter_dataframe is it generates a pandas query (which is stored in the local variable full_query) which looks something like "(Numbers >= 1 and Numbers <= 6) or ((Numbers == 'example 1') or (Numbers == 'example 2'))".

Obviously, because there are both strings and integers in the column, we will run into the error TypeError: '>=' not supported between instances of 'str' and 'int'. I know this function isn't greatly written, but are there any ideas on how work around this error?

CodePudding user response：

Create a temp column, with numeric type. Set it equal to your query column, or NaN where you find a string.

Now your .query(...) is a disjunct between a fast range query on the numeric column, and str set membership on the original column.

CodePudding user response：

You don't have to use query, boolean indexing should be easier:

def filter_dataframe(dataframe, column, numbers, strings=None):
    if strings is None:
        strings = []

    return dataframe[pd.to_numeric(dataframe[column], errors='coerce')
                       .between(*numbers)
                    | dataframe[columns].isin(strings)]